You just can’t trust what you hear.
That’s one of the many emerging terrors of our brave new funhouse-mirror dystopia of e-scams, cat-phishing, and democracy-destroying propaganda. In the good old days, only a few extremely skilled vocal impersonators could fake anyone’s voice enough to fool people familiar with the real thing. And so back then, the chances of Rich Little (look him up), Jay Pharoah, or Bill Hader targeting you in the middle of the night with a call from “your brother” asking you to transfer thousands of dollars in bail money were pretty much zero.
But now, thanks to AI voice emulators all over the internet, almost anyone online can commit audio fraud in minutes (and no, that sentence is clearly not encouragement to do so).
Fortunately for all the people who want to keep their money and the integrity of their elections safe from cybercriminals and political saboteurs, there’s been a breakthrough. Named the Rehearsal with Auxiliary-Informed Sampling, RAIS distinguishes real from faked voices and “maintains performance over time as attack types evolve.”
RAIS to the top
As Falih Gozi Febrinanto and his co-authors discuss in their paper “Rehearsal with Auxiliary-Informed Sampling for Audio Deepfake Detection,” existing detectors are failing against the latest deepfakes. That’s why RAIS is so important. Through rehearsal-based continual learning, RAIS “updates models using a limited set of old data samples” and “helps preserve prior knowledge while incorporating new information.”
Presented at Interspeech, the leading global gathering on spoken language processing science and technology, the paper explores how Febrinanto and fellow researchers at Australia’s national science agency CSIRO, Federation University Australia, and the Royal Melbourne Institute of Technology have created a brand new weapon in the fight against digital-audio deception, which weaponizes the bypassing of “voice-based biometric authentication systems, impersonation, and disinformation.”
Superior audio labelling beats engineered forgetting
Because of the need for ever-evolving defense against the ever-evolving threat, joint author Kristen Moore said that she and her colleagues want “detection systems to learn the new deepfakes without having to train the model again from scratch. If you just fine-tune on the new samples, it will cause the model to forget the older deepfakes it knew before.”
Current rehearsal techniques simply aren’t supple enough to detect just how varied the range of human voices – or even the range of one human’s voice – can be. And that lack of sophistication introduces bias and increases the likelihood of the model deleting critical information during new training, as Moore described.
Therefore, RAIS “employs a label generation network to produce auxiliary labels, guiding diverse sample selection for the memory buffer.” The result is superior fakery-detection, “achieving an average Equal Error Rate (EER) of 1.953% across five experiences.” EERs are biometric performance statistics generated during verification. The lower the EER gets, the more reliable is the biometric system that produced it. The RAIS code, which is highly effective despite using only a small memory buffer, is available on GitHub.
RAIS’s solution, says Moore, automatically selects and stores “a small, but diverse set of past examples, including hidden audio traits that humans may not even notice.” RAIS uses more labels for audio samples than the simple binary of “fake” and “authentic,” and through its more descriptive set of labels, and by retaining and rehearing with these labelled audio samples, the model can “help the AI learn the new deepfake styles without forgetting the old ones” and ensure “a richer mix of training data, improving its ability to remember and adapt over time.”
The deepfake threat is deeply real, and deeply global
Just as AI videos crawling across our social media feeds have become so much more believable that even skeptical people are fooled (mea culpa – just today I shared a video of a toddler convincing a puppy to stop barking, only to find another video moments later with a different toddler saying the exact same words in the same voice to a different puppy – and yes, I deleted it), the best new AI audio deepfakes no longer speak with bizarre cadences and weird stresses on the wrong syllables or the wrong words.
That new level of credibility is far more dangerous than the old-fashioned propagandist’s text-only gambit of deliberately misquoting or even inventing entire sentences for the mouths of one’s enemies.
That’s because, as AICompetence reported, “Studies show that AI-cloned voices can trigger stronger emotional responses than text-based misinformation. When a trusted voice sounds real, critical thinking pauses. That’s why synthetic audio, such as the deepfaked Biden robocall that urged New Hampshire voters not to cast ballots in the 2024 US presidential election, poses such unique danger. If a familiar voice told you not to vote, would you pause to verify it?
Other high-profile audio deepfake cases include that of Mark Read, CEO of WPP, the globe’s biggest advertising firm. Using a real photograph of him to create a Microsoft Teams account, fraudsters communicated during a Teams meeting by using Read’s deepfaked voice, trying (unsuccessfully) to establish a new business as a means to gain money and sensitive personal information. But scammers were more successful in Italy, when they deepfaked the voice of the country’s Minister of Defense to demand a €1M “ransom” from prominent business leaders. And some of them paid.
And just as deepfakers targeted Joe Biden and his supporters, Elon Musk reposted without context a deepfake-altered and deeply defamatory political advertisement of then-US Vice President Kamala Harris, violating the rules of the very platform he owned. Silicon scammers have launched attacks against electoral integrity in countries such as Bangladesh, Hungary, and Slovakia. During Slovakia’s 2023 federal election, cyberfraudsters posted phony audio clips of opposition leader Michal Šimečka allegedly plotting election fraud. Those clips propagated virally mere days before citizens marked their ballots.
As AICompetence explains, “The danger isn’t only in the lies themselves – it’s in how they undermine trust in everything genuine.” As more people understand what deepfakes are, “politicians may claim that authentic scandals are AI fabrications. Public awareness alone, without media literacy, can paradoxically amplify disinformation’s reach.”
And as Danielle Citron, Law Professor and co-author of Deep Fakes: The Coming Infocalypse succinctly and chilling summarised on AICompetence.org, “The real threat of deepfakes isn’t just that people will believe what’s false – it’s that they’ll stop believing what’s true.” There’s a term for this assault on truth itself: the liar’s dividend.
New Atlas has previously reported on the crisis in deepfakes, including the case of Microsoft Research Asia revealing “an AI model that can generate frighteningly realistic deepfake videos from a single still image and an audio track,” and the intensely disturbing experimental finding that 49% of participants “readily formed false memories” by believing that deepfakes of famous movies were real.
But New Atlas has also covered powerful new detection solutions in the fight for truth, as with AntiFake, a 2023 innovation from Washington University in St. Louis that may be one of the first tools to stop deepfakery before it can start, by “making it much harder for AI systems to read the crucial vocal characteristics in recordings of real people’s voices.”
Source: CSIRO

