(Part of the biometric product marketing expert series)
Normal people look forward to the latest album or movie. A biometric product marketing expert instead looks forward to an inaugural test report from the National Institute of Standards and Technology (NIST) on age estimation and verification using faces.
Waiting
I’ve been waiting for this report for months now (since I initially mentioned it in July 2023), and in April NIST announced it would be available in the next few weeks.
NIST news release
Yesterday I learned of the report’s public availability via a NIST news release.
A new study from the National Institute of Standards and Technology (NIST) evaluates the performance of software that estimates a person’s age based on the physical characteristics evident in a photo of their face. Such age estimation and verification (AEV) software might be used as a gatekeeper for activities that have an age restriction, such as purchasing alcohol or accessing mature content online….
The new study is NIST’s first foray into AEV evaluation in a decade and kicks off a new, long-term effort by the agency to perform frequent, regular tests of the technology. NIST last evaluated AEV software in 2014….
(The new test) asked the algorithms to specify whether the person in the photo was over the age of 21.
Well, sort of. We’ll get to that later.
Current AEV results
I was in the middle of a client project on Thursday and didn’t have time to read the detailed report, but I did have a second to look at the current results. Like other ongoing tests, NIST will update the age estimation and verification (AEV) results as these six vendors (and others) submit new algorithms.
This post looks at my three favorite questions:
- Why NIST tests age estimation (and everything else it tests).
- How NIST tests age estimation.
- What the NIST IR 8525 test says, and what it means.
Why NIST tests age estimation
Why does NIST test age estmation, or anything else?
The Information Technology Laboratory and its Information Access Division
One of NIST’s six research laboratories is its Information Technology Laboratory (ITL), charged “to cultivate trust in information technology (IT) and metrology.” Since NIST is part of the U.S. Department of Commerce, Americans (and others) who rely on information technology need an unbiased source on the accuracy and validity of this technology. NIST cultivates trust by a myriad of independent tests.
Some of those tests are carried out by one of ITL’s six divisions, the Information Access Division (IAD). This division focuses on “human action, behavior, characteristics and communication.”
The difference between FRTE and FATE
While there is a lot of IAD “characteristics” work that excites biometric folks, including ANSI/NIST standard work, contactless fingerprint capture, the Fingerprint Vendor Technology Evaluation (ugh), and other topics, we’re going to focus on our new favorite acronyms, FRTE (Face Recognition Technology Evaluation) and FATE (Face Analysis Technology Evaluation). If these acronyms are new to you, I talked about them last August (and the deprecation of the old FRVT acronym).
Basically, the difference between “recognition” and “analysis” in this context is that recognition identifies an individual, while analysis identifies a characteristic of an individual. So the infamous “Gender Shades” study, which tested the performance of three algorithms in identifying people’s sex and race, is an example of analysis.
Age analysis
The age of a person is another example of analysis. In and of itself an age cannot identify an individual, since around 385,000 people are born every day. Even with lower birth rates when YOU were born, there are tens or hundreds of thousands of people who share your birthday.
And your age matters in the situations I mentioned above. Even when marijuana is legal in your state, you can’t sell it to a four year old. And that four year old can’t (or shouldn’t) sign up for Facebook either.
You can check a person’s ID, but that takes time and only works when a person has an ID. The only IDs that a four year old has are their passport (for the few who have one) and their birth certificate (which is non-standard from county to county and thus difficult to verify). And not even all adults have IDs, especially in third world countries.
Self-testing
So companies like Yoti developed age estimation solutions that didn’t rely on government-issued identity documents. The companies tested their performance and accuracy themselves (see the PDF of Yoti’s March 2023 white paper here). However, there are two drawbacks to this:
- While I am certain that Yoti wouldn’t pull any shenanigans, results from a self-test always engender doubt. Is the tester truly honest about its testing? Does it (intentionally or unintentionally) gloss over things that should be tested? After all, the purpose of a white paper is for a vendor to present facts that lead a prospect to buy a vendor’s solution.
- Even with Yoti’s self tests, it did not have the ability (or the legal permission) to test the accuracy of its age estimation competitors.
How NIST tests age estimation
Enter NIST, where the scientists took a break from meterological testing or whatever to conduct an independent test. NIST asked vendors to participate in a test in which NIST personnel would run the test on NIST’s computers, using NIST’s data. This prevented the vendors from skewing the results; they handed their algorithms to NIST and waited several months for NIST to tell them how they did.
I won’t go into it here, but it’s worth noting that a NIST test is just a test, and test results may not be the same when you implement a vendor’s age estimation solution on CUSTOMER computers with CUSTOMER data.
The NIST internal report I awaited
NOW let’s turn to the actual report, NIST IR 8525 “Face Analysis Technology Evaluation: Age Estimation and Verification.”
NIST needed a set of common data to test the vendor algorithms, so it used “around eleven million photos drawn from four operational repositories: immigration visas, arrest mugshots, border crossings, and immigration office photos.” (These were provided by the U.S. Departments of Homeland Security and Justice.) All of these photos include the actual ages of the persons (although mugshots only include the year of birth, not the date of birth), and some include sex and country-of-birth information.
For each algorithm and each dataset, NIST recorded the mean absolute error (MAE), which is the mean number of years between the algorithm’s estimate age and the actual age. NIST also recorded other error measurements, and for certain tests (such as a test of whether or not a person is 17 years old) the false positive rate (FPR).
The challenge with the methodology
Many of the tests used a “Challenge-T” policy, such as “Challenge 25.” In other words, the test doesn’t estimate whether a person IS a particular age, but whether a person is WELL ABOVE a particular age. Here’s how NIST describes it:
For restricted-age applications such as alcohol purchase, a Challenge-T policy accepts people with age estimated at or above T but requires additional age assurance checks on anyone assessed to have age below T.
So if you have to be 21 to access a good or service, the algorithm doesn’t estimate if you are over 21. Instead, it estimates whether you are over 25. If the algorithm thinks you’re over 25, you’re good to go. If it thinks you’re 24, pull out your ID card.
And if you want to be more accurate, raise the challenge age from 25 to 28.
NIST admits that this procedure results in a “tradeoff between protecting young people and inconveniencing older subjects” (where “older” is someone who is above the legal age but below the challenge age).
NIST also performed a variety of demographic tests that I won’t go into here.
What the NIST age estimation test says
OK, forget about all that. Let’s dig into the results.
Which algorithm is the best for age estimation?
It depends.
I’ve covered this before with regard to facial recognition. Because NIST conducts so many different tests, a vendor can turn to any single test in which it placed first and declare it is the best vendor.
So depending upon the test, the best age estimation vendor (based upon accuracy and or resource usage) may be Dermalog, or Incode, or ROC (formerly Rank One Computing), or Unissey, or Yoti. Just look for that “(1)” superscript.
You read that right. Out of the 6 vendors, 5 are the best. And if you massage the data enough you can probably argue that Neurotechnology is the best also.
So if I were writing for one of these vendors, I’d argue that the vendor placed first in Subtest X, Subtest X is obviously the most important one in the entire test, and all the other ones are meaningless.
But the truth is what NIST said in its news release: there is no single standout algorithm. Different algorithms perform better based upon the sex or national origin of the people. Again, you can read the report for detailed results here.
What the report didn’t measure
NIST always clarifies what it did and didn’t test. In addition to the aforementioned caveat that this was a test environment that will differ from your operational environment, NIST provided some other comments.
The report excludes performance measured in interactive sessions, in which a person can cooperatively present and re-present to a camera. It does not measure accuracy effects related to disguises, cosmetics, or other presentation attacks. It does not address policy nor recommend AV thresholds as these differ across applications and jurisdictions.
Of course NIST is just starting this study, and could address some of these things in later studies. For example, its ongoing facial recognition accuracy tests never looked at the use case of people wearing masks until after COVID arrived and that test suddenly became important.
What about 22 year olds?
As noted above, the test used a Challenge 25 or Challenge 28 model which measured whether a person who needed to be 21 appeared to be 25 or 28 years old. This makes sense when current age estimation technology measures MAE in years, not days. NIST calculated the “inconvenience” to 21-25 (or 28) year olds affected by this method.
What about 13 year olds?
While a lot of attention is paid to the use cases for 21 year olds (buying booze) and 18 year olds (viewing porn), states and localities have also paid a lot of attention to the use cases for 13 year olds (signing up for social media). In fact, some legislators are less concerned about a 20 year old buying a beer than a 12 year old receiving text messages from a Meta user.
NIST tests for these in the “child online safety” tests, particularly these two:
- Age < 13 – False Positive Rates (FPR) are proportions of subjects aged below 13 but whose age is estimated from 13 to 16 (below 17).
- Age ≥ 17 – False Positive Rates (FPR) are proportions of subjects aged 17 or older but whose age is estimated from 13 to 16.
However, the visa database is the only one that includes data of individuals with actual ages below age 13. The youngest ages in the other datasets are 14, or 18, or even 21, rendering them useless for the child online safety tests.
Why NIST researchers are great researchers
The mark of a great researcher is their ability to continue to get funding for their research, which is why so many scientific papers conclude with the statement “further study is needed.”
Here’s how NIST stated it:
Future work: The FATE AEV evaluation remains open, so we will continue to evaluate and report on newly submitted prototypes. In future reports we will: evaluate performance of implementations that can exploit having a prior known-age reference photo of a subject (see our API); consider whether video clips afford improved accuracy over still photographs; and extend demographic and quality analyses.
Translation: if Congress doesn’t continue to give NIST money, then high school students will get drunk or high, young teens will view porn, and kids will encounter fraudsters on Facebook. It’s up to you, Congress.