In 2016, Aäron van den Oord had just won an award for his research in image generation when he was struck by an idea. If his technique could learn to predict a two-dimensional sequence of pixels, could it also learn to predict a waveform and thus generate realistic voices? The idea was intriguing but seemed like a long shot. His manager at DeepMind, an AI research subsidiary of Google, gave him two weeks to try it out, saying that if it didn’t work, he should move on to something else.
The results beat everyone’s expectations. Within two weeks, van den Oord had a prototype. Within three months, it was generating more realistic voices than any existing systems. Within another year, Google had begun using WaveNet, as the system came to be called, to generate voices for Google Assistant.
WaveNet now powers 51 voices as well as Google’s newest voice assistant, which calls salons and restaurants on behalf of users to book appointments or reserve tables. The results are startlingly realistic. When Google CEO Sundar Pichai first demoed Duplex in 2018, with all its human-like “umms” and “ahs,” it set a new bar for what can be possible when people communicate with machines.
While voice assistants need to do more than just generate a synthetic voice—they also need to be able to recognize when someone is talking and understand what’s being said, each of which is a challenge unto itself—researchers have long sought to create the right artificial voice for achieving natural and engaging conversations. “There’s a lot of meaning in a voice,” says van den Oord.