Vision Transformer (ViT) Models and Presentation Attack Detection

I tend to view presentation attack detection (PAD) through the lens of iBeta or occasionally of BixeLab. But I need to remind myself that these are not the only entities examining PAD.

A recent paper authored by Koushik SrivatsanMuzammal Naseer, and Karthik Nandakumar of the Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI) addresses PAD from a research perspective. I honestly don’t understand the research, but perhaps you do.

Flip spoofing his natural appearance by portraying Geraldine. Some were unable to detect the attack. By NBC Television. – eBay itemphoto frontphoto back, Public Domain, https://commons.wikimedia.org/w/index.php?curid=16476809

Here is the abstract from “FLIP: Cross-domain Face Anti-spoofing with Language Guidance.”

Face anti-spoofing (FAS) or presentation attack detection is an essential component of face recognition systems deployed in security-critical applications. Existing FAS methods have poor generalizability to unseen spoof types, camera sensors, and environmental conditions. Recently, vision transformer (ViT) models have been shown to be effective for the FAS task due to their ability to capture long-range dependencies among image patches. However, adaptive modules or auxiliary loss functions are often required to adapt pre-trained ViT weights learned on large-scale datasets such as ImageNet. In this work, we first show that initializing ViTs with multimodal (e.g., CLIP) pre-trained weights improves generalizability for the FAS task, which is in line with the zero-shot transfer capabilities of vision-language pre-trained (VLP) models. We then propose a novel approach for robust cross-domain FAS by grounding visual representations with the help of natural language. Specifically, we show that aligning the image representation with an ensemble of class descriptions (based on natural language semantics) improves FAS generalizability in low-data regimes. Finally, we propose a multimodal contrastive learning strategy to boost feature generalization further and bridge the gap between source and target domains. Extensive experiments on three standard protocols demonstrate that our method significantly outperforms the state-of-the-art methods, achieving better zero-shot transfer performance than five-shot transfer of “adaptive ViTs”.

From https://koushiksrivats.github.io/FLIP/?utm_source=tldrai

FLIP, by the way, stands for “Face Anti-Spoofing with Language-Image Pretraining.” CLIP is “contrastive language-image pre-training.”

While I knew I couldn’t master this, I did want to know what LIP and ViT were.

However, I couldn’t find something that just talked about LIP: all the sources I found talked about FLIP, CLIP, PLIP, GLIP, etc. So I gave up and looked at Matthew Brems’ easy-to-read explainer on CLIP:

CLIP is the first multimodal (in this case, vision and text) model tackling computer vision and was recently released by OpenAI on January 5, 2021….CLIP is a bridge between computer vision and natural language processing.

From https://www.kdnuggets.com/2021/03/beginners-guide-clip-model.html

Sadly, Brems didn’t address ViT, so I turned to Chinmay Bhalerao.

Vision Transformers work by first dividing the image into a sequence of patches. Each patch is then represented as a vector. The vectors for each patch are then fed into a Transformer encoder. The Transformer encoder is a stack of self-attention layers. Self-attention is a mechanism that allows the model to learn long-range dependencies between the patches. This is important for image classification, as it allows the model to learn how the different parts of an image contribute to its overall label.

The output of the Transformer encoder is a sequence of vectors. These vectors represent the features of the image. The features are then used to classify the image.

From https://medium.com/data-and-beyond/vision-transformers-vit-a-very-basic-introduction-6cd29a7e56f3

So Srivatsan et al combined tiny little bits of images with language representations to determine which images are (using my words) “fake fake fake.”

From https://www.youtube.com/shorts/7B9EiNHohHE

Because a bot can’t always recognize a mannequin.

Or perhaps the bot and the mannequin are in shenanigans.

The devil made them do it.

1 Comment

Leave a Comment