[FaVCI2D] Face Verification with Challenging Imposters and Diversified Demographics

2 minute read


With the recent General Data Protection Regulation (GDPR) directive in European Union (EU) regarding data protection, establishing a new person identification dataset is extremely laborious. There have been many questions raised by the research communities whether they can build large datasets such as the public MegaFace or the private DeepFace or FaceNet, owned by Facebook, and Google, respectively without any infringement of the aforementioned directive. In this context, during dataset design, we put focus on: (1) compliance with legal requirements, (2) data minimization by storing only information necessary to fulfill the task, and (3) reducing demographic imbalance to ensure a fair analysis of demographic segments.

The proposed dataset includes identities from 153 countries. The total number of unique IDs is 52,411, with 12,468 of them being used in genuine pairs. The total number of images is 64,879, with two images for IDs from genuine pairs and one for the imposter-only IDs. The complete versions of $FaVCI2D$, created with random and challenging imposter selection, include a total of 24,936 pairs divided equally between the two types of pairs.

We target a balanced gender and geographic distribution. It was possible to obtain enough pairs for America, Asia and Europe but not for Africa. The dataset includes 3,708 genuine pairs, 50% female - 50% male, for each of the first three regions and 1,344 for Africa, 23.3% female - 76.7% male. The gender distribution of IDs in the entire dataset is 44% female - 56% male, which is the closest we can get with reusable resources to a perfect balance.

Age-related information was found for 6,535 out of a total of 12,468 genuine pairs. The distribution of ages at the moment when the photo was taken is: 17% for 18-25 years old; 27% for 26 - 35; 18% for 36-45; 15% for 46-55; 12% for 56-65; 7% for 66-75 and 4% for 76 and over. The distribution of age difference (in years) between two images in genuine pairs is 18.5% for the same age, 29.5% for 1 and 2, 23% between 3 and 5; 17% between 6 and 10; and 12 for more than years. While relatively imbalanced the two age-related distributions, they include enough examples in each range to run an age-oriented analysis of verification results.

Please find all the details about the data set in the draft of the main paper here [https://arxiv.org/abs/2110.08667]

If you would like to request access to the data set, please visit the data set GitHub repository [link].

author    = {Popescu, Adrian and {\c{S}}tefan, Liviu-Daniel and Deshayes-Chossart, J{\'e}r{\^o}me and Ionescu, Bogdan},
title     = {Face Verification with Challenging Imposters and Diversified Demographics},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
month     = {January},
year      = {2022}