Pipe Down Before Panicking Over Voice Resonance Alteration

(Part of the biometric product marketing expert series)

By Steve Tan [steve.tan@pvc4pipes.com] – http://www.pvc4pipes.com, Attribution, https://commons.wikimedia.org/w/index.php?curid=22089684

On the surface, it sounds scary. Tricking automated speaker identification systems with PVC pipe?

(D)igital security engineers at the University of Wisconsin–Madison have found these systems are not quite as foolproof when it comes to a novel analog attack. They found that speaking through customized PVC pipes — the type found at most hardware stores — can trick machine learning algorithms that support automatic speaker identification systems.

From https://news.wisc.edu/down-the-tubes-common-pvc-pipes-can-hack-voice-identification-systems/

So how does the trick work?

The project began when the team began probing automatic speaker identification systems for weaknesses. When they spoke clearly, the models behaved as advertised. But when they spoke through their hands or talked into a box instead of speaking clearly, the models did not behave as expected.

(Shimaa) Ahmed investigated whether it was possible to alter the resonance, or specific frequency vibrations, of a voice to defeat the security system. Because her work began while she was stuck at home due to COVID-19, Ahmed began by speaking through paper towel tubes to test the idea. Later, after returning to the lab, the group hired Yash Wani, then an undergraduate and now a PhD student, to help modify PVC pipes at the UW Makerspace. Using various diameters of pipe purchased at a local hardware store, Ahmed, Yani and their team altered the length and diameter of the pipes until they could produce the same resonance as they voice they were attempting to imitate.

Eventually, the team developed an algorithm that can calculate the PVC pipe dimensions needed to transform the resonance of almost any voice to imitate another. In fact, the researchers successfully fooled the security systems with the PVC tube attack 60 percent of the time in a test set of 91 voices, while unaltered human impersonators were able to fool the systems only 6 percent of the time.

From https://news.wisc.edu/down-the-tubes-common-pvc-pipes-can-hack-voice-identification-systems/

Impressive results. But…

Who was fooled?

We’ve run across these biometric spoof claims before, specifically in the first test that asserted that face categorization algorithms were racist and sexist. (Face categorization, not face recognition. That’s another story.) If you didn’t view the Gender Shades website, you’d immediately assume that the hundreds of existing face categorization algorithms had just been proven to be racist and sexist. But if you read the Gender Shades study, you’ll see that it only tested three algorithms (IBM, Microsoft, and Face++). Similarly, the Master Faces study only looked at three algorithms (Dlib, FaceNet, and SphereFace).

So let’s ask the question: which voice algorithms did UW-Madison test?

Here’s what the study (PDF) says.

We evaluate two state-of-the-art ASI models: (1) the x-vector network [51] implemented by Shamsabadi et al. [45], and (2) the emphasized channel attention, propagation and aggregation time delay neural network (ECAPATDNN) [17], implemented by SpeechBrain.1 Both models were trained on VoxCeleb dataset [15, 36, 37], a benchmark dataset for ASI. The x-vector network is trained on 250 speakers using 8 kHz sampling rate. ECAPA-TDNN is trained on 7205 speakers using 16 kHz sampling rate. Both models report a test accuracy within 98-99%.

From https://www.usenix.org/system/files/sec23fall-prepub-452-ahmed.pdf

So what we know is that this test, which used these two ASI models trained on a particular dataset, demonstrated an ability to fool systems 60 percent of the time.

But…

  • What does this mean for other ASI algorithms, including the commercial algorithms in use today?
  • And what does it mean when other datasets are used?

In other words (and I’m adapting my own text here), how do the results of this study affect “current automatic speaker identification products”?

The answer is “We don’t know.”

So pipe down…until we actually test commercial algorithms for this technique.

But I’m sure that the UW-Madison researchers and I agree on one thing: more research is needed.

Leave a Comment