Alphabet Inc’s Google told Reuters this week it is developing an alternative to the industry standard method for classifying skin tones, which a growing chorus of technology researchers and dermatologists says is inadequate for assessing whether products are biased against people of color.
At issue is a six-color scale known as Fitzpatrick Skin Type (FST), which dermatologists have used since the 1970s. Tech companies now rely on it to categorize people and measure whether products such as facial recognition systems or smartwatch heart-rate sensors perform equally well across skin tones.
Critics say FST, which includes four categories for “white” skin and one apiece for “black” and “brown,” disregards diversity among people of color. Researchers at the US Department of Homeland Security, during a federal technology standards conference last October, recommended abandoning FST for evaluating facial recognition because it poorly represents color range in diverse populations.
In response to Reuters’ questions about FST, Google, for the first time and ahead of peers, said that it has been quietly pursuing better measures.
“We are working on alternative, more inclusive, measures that could be useful in the development of our products, and will collaborate with scientific and medical experts, as well as groups working with communities of color,” the company said, declining to offer details on the effort.
The controversy is part of a larger reckoning over racism and diversity in the tech industry, where the workforce is more white than in sectors like finance. Ensuring technology works well for all skin colors, as well different ages and genders, is assuming greater importance as new products, often powered by artificial intelligence (AI), extend into sensitive and regulated areas such as health care and law enforcement.
Companies know their products can be faulty for groups that are under-represented in research and testing data. The concern over FST is that its limited scale for darker skin could lead to technology that, for instance, works for golden brown skin but fails for espresso red tones.
Numerous types of products offer palettes far richer than FST. Crayola last year launched 24 skin tone crayons, and Mattel Inc’s Barbie Fashionistas dolls this year cover nine tones.
The issue is far from academic for Google. When the company announced in February that cameras on some Android phones could measure pulse rates via a fingertip, it said readings on average would err by 1.8 percent regardless of whether users had light or dark skin.
The company later gave similar warranties that skin type would not noticeably affect results of a feature for filtering backgrounds on Meet video conferences, nor of an upcoming web tool for identifying skin conditions, informally dubbed Derm Assist.
Those conclusions derived from testing with the six-tone FST.
The late Harvard University dermatologist Dr. Thomas Fitzpatrick invented the scale to personalize ultraviolet radiation treatment for psoriasis, an itchy skin condition. He grouped the skin of “white” people as Roman numerals I to IV by asking how much sunburn or tan they developed after certain periods in sun.
A decade later came type V for “brown” skin and VI for “black.” The scale is still part of US regulations for testing sunblock products, and it remains a popular dermatology standard for assessing patients’ cancer risk and more.
Some dermatologists say the scale is a poor and overused measure for care, and often conflated with race and ethnicity.
“Many people would assume I am skin type V, which rarely to never burns, but I burn,” said Dr. Susan Taylor, a University of Pennsylvania dermatologist who founded Skin of Color Society in 2004 to promote research on marginalized communities. “To look at my skin hue and say I am type V does me disservice.”
Technology companies, until recently, were unconcerned. Unicode, an industry association overseeing emojis, referred to FST in 2014 as its basis for adopting five skin tones beyond yellow, saying the scale was “without negative associations.”
A 2018 study titled “Gender Shades,” which found facial analysis systems more often misgendered people with darker skin, popularized using FST for evaluating AI. The research described FST as a “starting point,” but scientists of similar studies that came later told Reuters they used the scale to stay consistent.
“As a first measure for a relatively immature market, it serves its purpose to help us identify red flags,” said Inioluwa Deborah Raji, a Mozilla fellow focused on auditing AI.
In an April study testing AI for detecting deepfakes, Facebook Inc. researchers wrote FST “clearly does not encompass the diversity within brown and black skin tones.” Still, they released videos of 3,000 individuals to be used for evaluating AI systems, with FST tags attached based on the assessments of eight human raters.
The judgment of the raters is central. Facial recognition software startup AnyVision last year gave celebrity examples to raters: former baseball great Derek Jeter as a type IV, model Tyra Banks a V and rapper 50 Cent a VI.
AnyVision told Reuters it agreed with Google’s decision to revisit use of FST, and Facebook said it is open to better measures.
Microsoft Corp. and smartwatch makers Apple Inc. and Garmin Ltd. reference FST when working on health-related sensors.
But use of FST could be fueling “false assurances” about heart rate readings from smartwatches on darker skin, University of California San Diego clinicians, inspired by the Black Lives Matter social equality movement, wrote in the journal Sleep last year.
Microsoft acknowledged FST’s imperfections. Apple said it tests on humans across skin tones using various measures, FST only at times among them. Garmin said due to wide-ranging testing it believes readings are reliable.
Victor Casale, who founded makeup company Mob Beauty and helped Crayola on the new crayons, said he developed 40 shades for foundation, each different from the next by about 3 percent, or enough for most adults to distinguish.
Color accuracy on electronics suggest tech standards should have 12 to 18 tones, he said, adding, “you can’t just have six.”