Photo of Eunwoo Song

Artificial intelligence & robotics

Eunwoo Song

Introducing a speaker-adaptive training method for the neural TTS system in Korea.

Year Honored



As the accuracy of acoustic modeling has increased following the revolution of deep neural networks, the synthetic quality of text-to-speech (TTS) systems has improved significantly. However, these systems still have a major shortcoming in that a lot of training corpora are required to learn the complex nature of speech production. Typically, the conventional unit-selection TTS system requires more than 100 hours of human recordings to faithfully provide human-like voices to real-world applications such as AI speaker, audiobook, etc.

In NAVER Corp., one of the biggest IT companies in Korea, Eunwoo Song first introduced a speaker-adaptive training method for the neural TTS system. His contribution has significantly reduced the minimum amount of voice recordings from 100 hours to 1~4 hours, which enabled mass production of the TTS voices. As a result, NAVER could provide high-quality synthetic voices to customers who have used various NAVER services such as AI speaker, GPS navigation, and Papago translation. For instance, Clova Dubbing service can combine more than 100 TTS voices with the video clips and has assisted teachers in preparing online classes during the Pandemic. Nowadays, it is even possible to make a personalized TTS by using smart phone recordings. Specifically, he and his team members launched a campaign called “Mother’s Voice” that will make and provide 100 user’s personalized voice who have heartwarming stories with the family members.

Besides his work in NAVER Corp., he also shares his insights related to adaptive TTS techniques. Through the lectures (Seoul National University,. KAIST, Yonsei University, Korea University, etc) and the paper presentations (more than 30 presentations), he contributes to joint growth with speech researchers from the both industry and academic fields.