Baidu Develops Text-To-Speech System that Synthesizes Audio in Real Time

March 10, 2017May 25, 2019 Ellen Red audio, Baidu, Deep Learning, speech

Baidu, China’s leading internet search company, announced that it has developed an AI called “Deep Voice” – a text-to-speech system that can synthesize audio in real time or fractions of a second.

In the paper entitled “Deep Voice: Real-time Neural Text-to-Speech,” Baidu researchers wrote that Deep Voice differs from WaveNet, Google’s text-to-speech system, and other text-to-speech systems such as SampleRNN, and Char2Wav as it can “synthesize audio in fractions of a second, and offers a tunable trade-off between synthesis speed and audio quality.”

“In this work, we demonstrate that current Deep Learning approaches are viable for all the components of a high quality text-to-speech engine by building a fully neural system. Our system is trainable without any human involvement, dramatically simplifying the process of creating TTS (text-to-speech) systems,” Baidu researchers said.

Compared to other text-to-speech systems, according to Baidu researchers, Deep Voice is completely standalone as training a new Deep Voice system does not need a pre-existing text-to-speech system, and training can be done utilizing a dataset of short audio clips and corresponding text transcripts.

Unlike other systems which use hand-engineered features such as spectral envelope, spectral parameters, aperiodic parameters, Deep Voice only features are “phonemes with stress annotations, phoneme durations, and fundamental frequency (F0),” Baidu researchers said. They said this choice of features makes Deep Voice more applicable to new voices, datasets, and domains without the need of any human data annotation or further feature engineering.

Text-to-speech systems are particularly important in navigation systems, speech-enabled devices and accessibility for the visually-impaired individuals.