Tacotron 2: Generating Human Like Speech from Text

Tacotron 2: Google simplifies the process of teaching AI how to speak like human

Developing the perfect language translation tool has been a difficult challenge for the scientists, researchers, entrepreneurs and others alike for quite some time. Google has made some lead with its translation services but when it comes to AI based speaking almost everyone comes out as robotic voice which can be easily differentiated from a human voice. Google has developed a new method which will aid developers in training neural network to produce realistic speed with Tacotron 2.

This Tacotron 2 will help in bringing realistic kind of speech to the translators through analysing the text and even without the need of any grammatical expertise in the said language. This method is making use of two different speech generation projects namely the original Tacoron and the WaveNet.

The earlier voice generators

During the early days WaveNet left everyone speechless by offering eerily convincing speech but with one audio sample at a time. This was a great achievement but it wasn’t effective when it came to language translation or text to speech translation objectives. In order to get into the world of voice generation WaveNet had to accumulate a great range of metadata relating to the every aspect of the language from the right pronunciation to the key linguistic features. But this problem was overcome with the launch of the first Tacotron which helped in getting high end linguistic feature absolutely but it wasn’t the best enough to be used directly in the speech products of the time.

How Tacotron 2 works?

The Tacotron 2 makes of both the pieces of text and narration in order to get to the right way that particular language is spoken by the natives. In short it specifically calculates all the linguistic rules which might apply to the given text in order to render a human like voice. In essence this method helps in converting the text into very peculiar Tacotron style mel-scale spectrogram which helps in identifying the rhythm and emphasis. It shouldn’t come as surprise that the words are formed using this WaveNet style system for better realistic appeal. The resulting speech with Tacotron 2 is extremely convincing as a realistic human voice but it always happens to be quiet chipper than the usual.

Overcoming the challenges with Tacotron 2

This technique of Tacotron 2 still have a number of shortcoming to overcome before it renders human like voice for language translation in the future. Researchers have stated that this technique faces immense difficulty when it comes to pronouncing difficult words like decorum and merlot. In some cases it freaks out by generating some really eerie and strange noise. The second major shortcoming with this technique is that it isn’t able to translate and generate voice in real-time. Thirdly researchers can’t even control the tone of the generated voice or compel it sound happy or sad. But rigorous research is being undertaken by the team to overcome these challenges on the Tacotron 2 and bring the worlds first realistic voice generator to the masses.

Tags : Google Google Assistant Tacotron