Researchers at Alphabet [Google’s parent company] have been on research for over a decade in text-to-speech (TTS) space to make the speech sound like real human. Incorporating ideas from past work such as Tacotron and WaveNet, they have added more improvements to end up with their latest system, Tacotron 2. It generates human-like speech from text using neural networks trained using only speech examples and corresponding text transcripts.
How Tacotron 2 works?
Tacotron 2 uses a sequence-to-sequence model optimized for TTS to map a sequence of letters to a sequence of features that encodes the audio. Tacotron 2 consists of two deep learning neural networks. As the research paper published describes it, the first translates text into a spectrogram, a visual representation of a spectrum of audio frequencies. The second — DeepMind’s WaveNet — interprets the chart and generates corresponding audio elements. The result is an end-to-end engine that can emphasize words, correctly pronounce names, pick up on syntactical clues (that is stress words that are italicized or CAPITALIZED),thereby altering the pronunciation of the words.
You can listen to audio samples generated by Tacotron 2 that are generated using this advanced TTS system. In a detailed evaluation it is found that the results are very close to that of real humans. It has achieved a MOS score of 4.53 which is the best so far compared to a MOS score of 4.58 which is for professionally recorded speech !
Major hurdles faced by Tacotron 2
Though the samples are great there are still some hurdles which Tacotron 2 is facing.
- Difficulty in pronunciation of some complex words
- At times random noise is generated
- Realtime audio generation is not yet functional
- Generated speech cannot be controlled. for ex: it cannot be made to sound happy or sad as per the required tone.
Considering the investment of time and money , the launch of the first public beta doesnt seem that far away. Looking forward to see when Tacotron 2 would be available for beta testers. Subscribe to us to know when the engine is ready for launch.