Someone remotely accesses your bank account using a computer-generated voice that sounds indistinguishable from yours. A hacker uses the voice recognition in your Alexa to charge a bunch of Amazon orders to your account. You flip on the TV and see that leaked audio of a political candidate saying something explosive is sweeping across social media, disrupting a country’s presidential election. No one knows it yet, but that recording has been faked too.
Sounds like a dystopian vision of the future. But according to Associate Professor of Electrical and Computer Engineering Hafiz Malik, all of these scenarios are possible today — and are likely to proliferate as voice-based technologies become an even bigger part of our lives.
Malik seems to be among the few who saw it coming. He began working in the then-obscure field of “audio forensics” back in 2010, when Alexa and Google Home were still in the thought experiment stage. The problem, as he defines it, is that big tech companies spent a ton of energy building products that could both understand your voice and emulate human speech. But their eye was on creating a super cool user experience, not on the scary things people could do with the tech.
As a result, many of our technologies that use voice are now vulnerable to attacks. Threats like “replay attacks,” where you use a recording of someone’s voice to fool a piece of technology into thinking they’re actually there, are simple but still effective. (You can try that out yourself using your own phone and Alexa.) Now, though, Malik said an even more sinister threat comes from “cloned” audio. Using just a few seconds of your real voice and off-the-shelf software, a computer can create a realistic simulation of you that will literally say anything.
The implications for the latter are particularly troubling, and not just when it comes to hacked bank accounts or smart speakers. In the political sphere, there’s a whole emerging world of bogus multimedia known as “deep fakes.” This is where a computer generates video or audio that’s close to indistinguishable from real video or audio. Imagine footage of a political leader giving a fake speech. Or, Malik said, a fake phone call from a military commander giving an order to initiate a coup. Some additional technologies that we’re already adopting — like digitally generated newscasters — make it even easier to pull off a deep fake.
Malik anticipated potential abuses like this years ago, and soon after, began researching possible defenses. The first step was seeing if there was, in fact, a reliable way to sort out real from fake audio. “Our initial hypothesis was that audio generated through computers, even though it was perceptually identical to human voice-based audio, would look different on the ‘microscopic’ level.” That turned out to be the case: When he subjected real and fake audio to spectral analysis, he found some easily detectable markers. The method he’s developed has near 100 percent accuracy.
But Malik said that’s still only a partial solution.
“In the case of political disinformation, for example, let’s suppose somebody creates audio of a political figure. It will spread almost instantly on social media. By the time law enforcement or the media or the campaign team responds, it’s already out everywhere. In some ways, the damage is already done,” he said. “Though it didn’t involve cloned audio, we saw this recently in India, where people actually died because of disinformation spreading like wildfire over platforms like WhatsApp.”
Because of this, Malik said a robust defense must not only be able to detect fake multimedia — it also needs to do it nearly in real time. That’s where Malik is turning his attention now. The latest suite of tools he’s developing would process an audio signal, identify tell-tale signs of computer fakery and then render an “authenticity score.” His methods even have implications for spotting fake video, as the audio of a speaker alone could be used to give away the media as computer-generated.
A tool like that could be deployed in any number of ways, Malik said. Newsrooms could use it to digitally “fact check” leaked audio from an anonymous source. Social media platforms could integrate these detection tools directly into their architecture — automatically processing multimedia and alerting users of suspicious content. And Alexa or Google Home could judge whether it’s really you who’s ordering that new gadget — or just a computer that sounds like you.
Malik said even then, we are still in for a wild ride.
“As with most security issues, it’s a game of back and forth,” he said. “An attacker develops a strategy, you defend it. Then, the attacker figures out your defense strategy, so you have to update your defense. But I think we are very close to a place where seeing is no longer believing. I already feel that way, personally. I don’t assume something is authentic unless I can verify it from multiple sources. It’s scary, but it’s the world we’re living in.”