“Ok Google,” I said to my Honor 9 phone running Android Pie. “How much does the Valve Index retail for?” I asked. The Google Assistant app said $999, while displaying a paragraph of text with the cost of the Valve Index, taken from an article on The Verge.
Over the next few days, I experimented with the Google Assistant app, asking it to tell me a joke, whether it thinks it’s conscious (it doesn’t), whether it loves Siri (they get along), and all sorts of informative questions that it would almost always give a reasonably good to completely accurate answer to.
Not only am I amazed at how well Google Assistant could answer my questions, but I’m surprised at how accurately it understood what I was saying too. Even when I mumbled or when I was in a busy coffee shop, the app seemed to recognize my voice and understand what I said. Clearly, Google has made significant progress with its speech recognition algorithms over the last few years.
This technological progress explains the success of smart speakers such as the Amazon Echo and the Sonos One, both of which are speakers using Amazon’s Alexa algorithm, which is widely considered the best virtual assistant on the market right now.
Voice search represents an important shift in computer history, because it flips around our relationship with technology. Before voice, through tools such as a keyboard, a controller, a mouse, a screen, we always had to learn how to interact with the machine.
With voice, however, for the first time, the machine has to learn how to interact with us. It needs to understand our words (speech recognition) and, in some cases, produce an answer in our language, in a way that sounds natural (speech synthesis).
How Does Speech Recognition Work?
In order to set up voice recognition technology, you’ll need to send a few voice samples to your device (whether that’s your phone or your smart speaker). The device will convert these samples into a digital waveform that is unique to you. Think of your voice as a sound fingerprint. There’s no other voice in the world like it.
In fact, banks have started using voice recognition as a way of securing your account. It’s certainly intuitive, although people still have legitimate concerns about the security of voice authentication. In 2017, for example, a BBC reporter could access the account of his non-identical twin by mimicking his voice. It took eight tries, but he got in nonetheless. So voice recognition is not an entirely foolproof authentication method quite yet.
Once you’ve set up and activated your device (usually with a passphrase like “hey Siri” or “OK Google”), whatever you say becomes the input to your device. Once again, it will turn the analog sound waves of your voice into a digital waveform that’s basically a string of numbers. That, in turn, becomes a spectrogram which is broken up into frames processed to find the phonemes (i.e. the letters of the spoken language) that each frame contains.
The phonemes found in the frames are then compared to a phonetic dictionary to identify what’s been said. Or, more accurately, to identify what’s probably been said. No speech recognition algorithm will ever be fully accurate.
Accuracy is further improved with the help of language models such as hidden Markov models and neural networks, which assign probabilities to phonemes depending on their location and their relationship with the phonemes they’re surrounded by. Complex stuff, to say the least, but it works, and it’s become remarkably accurate.
How Does Speech Synthesis Work?
Speech synthesis goes a step further than speech recognition, because now your device needs to talk back too. It’s effectively the same process as speech recognition, in the sense that your device creates a voice spectrogram from the analog sound waves of your voice, after which it maps the found phonemes to fragments of the spectrogram it created.
But then, it needs to talk. Speech synthesis happens mainly in three ways:
Concatenative synthesis, which means tying together (concatenating) short samples of recorded sound stored in a database. The quality of the output will depend on the quality of the phonemes in the database.
Formant synthesis, which means sending out frequencies of sound. After all, the human voice is exactly that: a frequency of sound. While this technique can sound quite artificial, it’s a good choice for GPS systems, as this technique helps a device pronounce whatever strange combination it has to (such as a foreign street name).
Articulatory synthesis, which is the most realistic, human-like, and complex form of synthesis. This type of synthesis models the human vocal apparatus and simulates the movements of the speech articulators.
Here, too, we’ve made significant progress over the past few years. Increasingly, algorithms can accurately clone someone’s voice based on very little data. For example, here’s the cloned voice of famous podcaster Joe Rogan, trained by listening to his podcasts.
The Ethics of Speech Synthesis
We’ve come to the point where speech technology has advanced so much that it’s worth talking about the ethics of it. Algorithms are now capable enough to realistically simulate someone’s voice using a just a few hours of audio (and sometimes much less). Combine this with deepfake algorithms that can create an entirely realistic, but fake video of you, and we’ve reached the point where algorithms can recreate you from some data on the Internet and no one would be able to tell the difference.
This is as dangerous as it is empowering. Your voice is yours alone.As Dwight Schrute from The Office (US) would say: “identity theft is not a joke”, and it’s true. Your voice is part of your identity and should never be used without your explicit permission. Unfortunately, there’s little we can do against deepfake videos and simulated voices right now. Perhaps ironically, we’ll likely end up fighting this trend with algorithms that can identify what’s fake and what’s not.
However, it’s worth nuancing the above too. Every new technology will inevitably be abused in some way or the other. Often, such incidents of abuse are splayed all over the media, because they make for good stories. We usually hear much less of the benefits of such new technology. Voice technology is an enormous step forward. It can make our interaction with technology more intuitive, more inclusive, and, eventually, more secure.
That’s only the start of it too. We’re moving to a world where our interaction with technology becomes much more immersive, much more seamless, where it becomes indistinguishable from magic (as Isaac Asimov would say). A screen or any tool that we need to learn to interact with a machine will eventually be seen as antiquated, an old-fashioned way to interact with anything. Voice technology moves us in that direction.