Faulty “journalism” conclusions: the Israeli “master faces” study DIDN’T test ANY commercial biometric algorithms

(Part of the biometric product marketing expert series)

Modern “journalism” often consists of reprinting a press release without subjecting it to critical analysis. Sadly, I see a lot of this in publications, including both biometric and technology publications.

This post looks at the recently announced master faces study results, the datasets used (and the datasets not used), the algorithms used (and the algorithms not used), and the (faulty) conclusions that have been derived from the study.

Oh, and it also informs you of a way to make sure that you don’t make the same mistakes when talking about biometrics.

Vulnerabilities from master faces

From https://www.facebook.com/UplandPolice/photos/pcb.5828261687245780/5828261513912464/

In facial recognition, there is a concept called “master faces” (similar concepts can be found for other biometric modalities). The idea behind master faces is that such data can potentially match against MULTIPLE faces, not just one. This is similar to a master key that can unlock many doors, not just one.

This can conceivably happen because facial recognition algorithms do not match faces to faces, but match derived features from faces to derived features from faces. So if you can create the right “master” feature set, it can potentially match more than one face.

However, this is not just a concept. It’s been done, as Biometric Update informs us in an article entitled ‘Master faces’ make authentication ‘extremely vulnerable’ — researchers.

Ever thought you were being gaslighted by industry claims that facial recognition is trustworthy for authentication and identification? You have been.

The article goes on to discuss an Israeli research project that demonstrated some true “master faces” vulnerabilities. (Emphasis mine.)

One particular approach, which they write was based on Dlib, created nine master faces that unlocked 42 percent to 64 percent of a test dataset. The team also evaluated its work using the FaceNet and SphereFace, which like Dlib, are convolutional neural network-based face descriptors.

They say a single face passed for 20 percent of identities in Labeled Faces in the Wild, an open-source database developed by the University of Massachusetts. That might make many current facial recognition products and strategies obsolete.

Sounds frightening. After all, the study not only used dlib, FaceNet, and SphereFace, but also made reference to a test set from Labeled Faces in the Wild. So it’s obvious why master faces techniques might make many current facial recognition products obsolete.

Right?

Let’s look at the datasets

It’s always more impressive to cite an authority, and citations of the University of Massachusetts’ Labeled Faces in the Wild (LFW) are no exception. After all, this dataset has been used for some time to evaluate facial recognition algorithms.

But what does Labeled Faces in the Wild say about…itself? (I know this is a long excerpt, but it’s important.)

DISCLAIMER:

Labeled Faces in the Wild is a public benchmark for face verification, also known as pair matching. No matter what the performance of an algorithm on LFW, it should not be used to conclude that an algorithm is suitable for any commercial purpose. There are many reasons for this. Here is a non-exhaustive list:

Face verification and other forms of face recognition are very different problems. For example, it is very difficult to extrapolate from performance on verification to performance on 1:N recognition.

Many groups are not well represented in LFW. For example, there are very few children, no babies, very few people over the age of 80, and a relatively small proportion of women. In addition, many ethnicities have very minor representation or none at all.

While theoretically LFW could be used to assess performance for certain subgroups, the database was not designed to have enough data for strong statistical conclusions about subgroups. Simply put, LFW is not large enough to provide evidence that a particular piece of software has been thoroughly tested.

Additional conditions, such as poor lighting, extreme pose, strong occlusions, low resolution, and other important factors do not constitute a major part of LFW. These are important areas of evaluation, especially for algorithms designed to recognize images “in the wild”.

For all of these reasons, we would like to emphasize that LFW was published to help the research community make advances in face verification, not to provide a thorough vetting of commercial algorithms before deployment.

While there are many resources available for assessing face recognition algorithms, such as the Face Recognition Vendor Tests run by the USA National Institute of Standards and Technology (NIST), the understanding of how to best test face recognition algorithms for commercial use is a rapidly evolving area. Some of us are actively involved in developing these new standards, and will continue to make them publicly available when they are ready.

So there are a lot of disclaimers in that text.

LFW is a 1:1 test, not a 1:N test. Therefore, while it can test how one face compares to another face, it cannot test how one face compares to a database of faces. The usual law enforcement use case is to compare a single face (for example, one captured from a video camera) against an entire database of known criminals. That’s a computationally different exercise from the act of comparing a crime scene face against a single criminal face, then comparing it against a second criminal face, and so forth.
The people in the LFW database are not necessarily representative of the world population, the population of the United States, the population of Massachusetts, or any population at all. So you can’t conclude that a master face that matches against a bunch of LFW faces would match against a bunch of faces from your locality.
Captured faces exhibit a variety of quality levels. A face image captured by a camera three feet from you at eye level in good lighting will differ from a face image captured by an overhead camera in poor lighting. LFW doesn’t have a lot of these latter images.

I should mention one more thing about LFW. The researchers allow testers to access the database itself, essentially making LFW an “open book test.” And as any student knows, if a test is open book, it’s much easier to get an A on the test.

By MCPearson – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=25969927

Now let’s take a look at another test that was mentioned by the LFW folks itself: namely, NIST’s Face Recognition Vendor Test.

This is actually a series of tests that has evolved over the years; NIST is now conducting ongoing tests for both 1:1 and 1:N (unlike LFW, which only conducts 1:1 testing). This is important because most of the large-scale facial recognition commercial applications that we think about are 1:N applications (see my example above, in which a facial image captured at a crime scene is compared against an entire database of criminals).

In addition, NIST uses multiple data sets that cover a number of use cases, including mugshots, visa photos, and faces “in the wild” (i.e. not under ideal conditions).

It’s also important to note that NIST’s tests are also intended to benefit research, and do not necessarily indicate that a particular algorithm that performs well for NIST will perform well in a commercial implementation. (If the algorithm is even available in a commercial implementation: some of the algorithms submitted to NIST are research algorithms only that never made it to a production system.) For the difference between testing an algorithm in a NIST test and testing an algorithm in a production system, please see Mike French’s LinkedIn article on the topic. (I’ve cited this article before.)

With those caveats, I will note that NIST’s FRVT tests are NOT open book tests. Vendors and other entities give their algorithms to NIST, NIST tests them, and then NIST tells YOU what the results were.

So perhaps it’s more robust than LFW, but it’s still a research project.

Let’s look at the algorithms

Now that we’ve looked at two test datasets, let’s look at the algorithms themselves and evaluate the claim that results for the three algorithms Dlib, FaceNet, and SphereFace can naturally be extrapolated to ALL facial recognition algorithms.

This isn’t the first time that we’ve seen such an attempt at extrapolation. After all, the MIT Media Lab’s Gender Shades study (which evaluated neither 1:1 nor 1:N use cases, but algorithmic attempts to identify gender and race) itself only used three algorithms. Yet the popular media conclusion from this study was that ALL facial recognition algorithms are racist.

Compare this with NIST’s subsequent study, which evaluated 189 algorithms specially for 1:1 and 1:N use cases. While NIST did find some race/sex differences in algorithms, these were not universal: “Tests showed a wide range in accuracy across developers, with the most accurate algorithms producing many fewer errors.”

In other words, just because an earlier test of three algorithms demonstrated issues in determining race or gender, that doesn’t mean that the current crop of hundreds of algorithms will necessarily demonstrate issues in identifying individuals.

So let’s circle back to the master faces study. How do the results of this study affect “current facial recognition products”?

The answer is “We don’t know.”

Has the master faces experiment been duplicated against the leading commercial algorithms tested by Labeled Faces in the Wild? Apparently not.

Has the master faces experiment been duplicated against the leading commercial algorithms tested by NIST? Well, let’s look at the various ways you can define the “leading” commercial algorithms.

For example, here’s the view of the test set that IDEMIA would want you to see: the 1:N test sorted by the “Visa Border” column (results as of August 6, 2021):

From https://pages.nist.gov/frvt/html/frvt1N.html as of August 6, 2021.

And here’s the view of the test set that Paravision would want you to see: the 1:1 test sorted by the “Mugshot” column (results as of August 6, 2021):

From https://pages.nist.gov/frvt/html/frvt11.html as of August 6, 2021.

Now you can play with the sort order in many different ways, but the question remains: have the Israeli researchers, or anyone else, performed a “master faces” test (preferably a 1:N test) on the IDEMIA, Paravision, Sensetime, NtechLab, Anyvision, or ANY other commercial algorithm?

Maybe a future study WILL conclude that even the leading commercial algorithms are vulnerable to master face attacks. However, until such studies are actually performed, we CANNOT conclude that commercial facial recognition algorithms are vulnerable to master face attacks.

So naturally journalists approach the results critically…not

But I’m sure that people are going to make those conclusions anyway.

While Matt Schneier doesn’t go to the extreme of saying that all facial recognition algorithms are now defunct, he does classify the research as “fascinating” WITHOUT commenting on its limitations or applicability. Schneier knows security, but he didn’t vet this one.
Gizmodo, on the other hand, breathlessly declares (in “‘Master Face’: Researchers Say They’ve Found a Wildly Successful Bypass for Face Recognition Tech”) that “you’d be safe to add (the study) to the growing body of literature that suggests facial recognition is bad news for everybody except cops and large corporations.” Apparently Gizmodo never read the NIST gender/race test results that I cited earlier.
And the convoluted title from Unite.AI is ridiculous on its face: “‘Master Faces’ That Can Bypass Over 40% Of Facial ID Authentication Systems.” No, 40% of all deployed facial recognition systems AREN’T using the three tested algorithms.

From https://xkcd.com/386/. Attribution-NonCommercial 2.5 Generic (CC BY-NC 2.5).

Does anyone even UNDERSTAND these studies? (Or do they choose NOT to understand them?)

How can you avoid the same mistakes when communicating about biometrics?

As you can see, people often write about biometric topics without understanding them fully.

Even biometric companies sometimes have difficulty communicating about biometric topics in a way that laypeople can understand. (Perhaps that’s the reason why people misconstrue these studies and conclude that “all facial recognition is racist” and “any facial recognition system can be spoofed by a master face.”)

Are you about to publish something about biometrics that requires a sanity check? (Hopefully not literally, but you know what I mean.)

Well, why not turn to a biometric content marketing expert? Use the identity/biometric blog expert to write your blog post, the identity/biometric case study expert to write your case study, or the identity/biometric white paper expert to…well, you get the idea. (And all three experts are the same person!)

Bredemarket offers over 25 years of experience in biometrics that can be applied to your marketing and writing projects.

If you don’t have a content marketing project now, you can still subscribe to my Bredemarket Identity Firm Services LinkedIn page or my Bredemarket Identity Firm Services Facebook group to keep up with news about biometrics (or about other authentication factors; biometrics isn’t the only one). Or scroll down to the bottom of this blog post and subscribe to my Bredemarket blog.

If my content creation process can benefit your biometric (or other technology) marketing and writing projects, contact me.

Send me an email at john.bredehoft@bredemarket.com.
Or go to calendly.com/bredemarket to book a meeting with me.
Or go to bredemarket.com/contact/ to use my contact form.

Faulty “journalism” conclusions: the Israeli “master faces” study DIDN’T test ANY commercial biometric algorithms

Vulnerabilities from master faces

Let’s look at the datasets

Let’s look at the algorithms

So naturally journalists approach the results critically…not

How can you avoid the same mistakes when communicating about biometrics?

Published by bredemarket

5 Comments

Leave a Comment Cancel reply

Vulnerabilities from master faces

Let’s look at the datasets

Let’s look at the algorithms

So naturally journalists approach the results critically…not

How can you avoid the same mistakes when communicating about biometrics?

Share this:

Related

Published by bredemarket

5 Comments

Leave a Comment Cancel reply