Skip to main content

Verified by Psychology Today

Artificial Intelligence

How AI Bias Impacts Medical Diagnosis

AI models that are good at predicting race/gender are less accurate in diagnosis.

Key points

  • AI medical imaging models are biased, performing better at predicting race/gender than diagnosing diseases.
  • AI models can learn biases from the data they are trained on or biases of the humans who create them.
  • Regulatory bodies should consider requiring developers to monitor the real-world performance of AI models.
flutie8211/Pixabay
Source: flutie8211/Pixabay

Just like humans, artificial intelligence (AI) machine learning models are prone to bias. Understanding the nature of AI bias is paramount for critical applications that may impact life-or-death decisions, such as in the medical and healthcare industries. New peer-reviewed research published in Nature Medicine not only shows that AI medical imaging models that excel at predicting race and gender do not perform as well at predicting disease diagnosis but also provide best practices to address this disparity.

“Our findings underscore the necessity for regular evaluation of model performance under distribution shift, challenging the popular opinion of a single fair model across different settings,” wrote senior author Marzyeh Ghassemi, Ph.D., an associate professor of electrical engineering and computer science at Massachusetts Institute of Technology (MIT), in collaboration with Dina Katabi, Ph.D., a professor of computer science and electrical engineering at MIT, Yuzhe Yang, a MIT CSAIL graduate student, Haoran Zhang, a MIT graduate student, and Judy Gichoya, Ph.D., an associate professor in radiology at the Emory University School of Medicine.

The use of AI machine learning in medical imaging is growing. The market size of AI in medical imaging is expected to reach USD 8.18 billion by 2030 globally, with a compound annual growth rate of 34.8% during 2023-2030, according to an industry report by Grand View Research. The use of AI in neurology was the largest segment, with a 38.3% share, and the North American region had a 44% revenue share in 2022, per Grand View Research. Examples of companies in the AI medical imaging space include IBM, GE Healthcare, Aidoc, Arterys, Enlitic, iCAD Inc., Caption Health, Gleamer, Fujifilm Holdings Corporation, Butterfly Network, AZmed, Siemens Healthineers, Koninklijke Philips, Agfa-Gevaert Group/Agfa Health Care, Imagia Cybernetics Inc., Lunit, ContextVision AB, Blackford Analysis, and others.

The emerging rise of using AI in medical imaging necessitates maximizing accuracy and minimizing biases. The term bias encompasses partiality, tendency, preference, and systematically flawed thinking patterns. In people, biases can be conscious or unconscious. There are numerous human biases; examples include stereotyping, the bandwagon effect, the placebo effect, confirmation bias, the halo effect, optimism, hindsight bias, anchoring bias, availability heuristic, survivorship bias, familiarity bias, gender bias, gambler’s fallacy, group attribution error, self-attribution bias, and many more.

AI bias impacts overall performance accuracy. In AI machine learning, algorithms learn from massive amounts of training data rather than explicitly hard-coded instructions. Several factors impact the resilience of AI models to bias. These include not only the quantity of training data but also the quality that is impacted by the level of objectivity in the data, the data structure itself, the data collection practices, and data sources.

Furthermore, AI models may be vulnerable to the inherent cognitive biases of humans in charge of creating the algorithm, the weights assigned to data points, and the absence or inclusion of indicators. For example, in February 2024, Google decided to pause its AI chatbot Gemini’s (formerly Bard) image generation after user complaints of it generating historically inaccurate images. Namely, Gemini tended to favor the generation of images of non-white people and was often incorrectly depicted with the wrong race and/or gender. Case in point, before the pause, Gemini mistakenly depicted George Washington and Vikings as Black men. Google attributed the fine-tuning of its image generator called Imagen 2, which ultimately led the AI model to go haywire in a company blog post.

“We confirm that medical imaging AI leverages demographic shortcuts in disease classification,” the researchers at MIT reported.

This was not surprising, given two years prior, Ghassemi, Gichoya, and Zhang were among the co-authors of a separate MIT and Harvard Medical School study published in The Lancet Digital Health that showed how AI deep learning models can predict a person’s self-reported race from medical image pixel data as inputs with a high degree of accuracy. The 2022 study demonstrated that AI easily learned to spot self-reported racial identity from medical images. However, humans do not understand how AI achieves this, thus underscoring the need to mitigate risk with regular audits and evaluations of medical AI.

For the current study, the scientists aimed to determine if AI disease classification models use demographic information as a heuristic and if these shortcuts cause biased predictions. The AI models were trained to predict if a patient had one of three medical conditions: a collapsed lung, enlargement of the heart, or fluid build-up in the lungs.

The AI models were trained on chest X-rays from a large publicly available dataset of chest radiographs from Beth Israel Deaconess Medical Center in Boston called MIMIC-CXR and then evaluated on a composite out-of-distribution dataset consisting of data from CheXpert, NIH, SIIM, PadChest, and VinDr. In data science, out-of-distribution data (OOD) is new data that the AI model was not trained on and, therefore, is considered “unseen,” as opposed to in-distribution (ID) data that the AI model has “seen” previously in the training data. Overall, the researchers used more than 854,000 chest X-rays from globally sourced radiology datasets that span three continents, 6,800 ophthalmology images, and over 32,000 dermatology images.

The AI models performed well in their predictions overall but exhibited disparities in prediction accuracies between gender and race. The AI models that performed the best at predicting demographics showed greater disparities in the accuracy of diagnosing images of different genders or races.

Next, the team investigated how effective state-of-the-art techniques can mitigate these shortcuts to create AI models that are less biased. They discovered they could mitigate the disparity in scenarios to reduce the biases. Still, these methods were most effective when the model was assessed on the same type of patients the AI was trained on originally, or in other words, with in-distribution data.

In a real-world clinical setting, it is not unusual for AI models to be trained on data that is out-of-distribution from a different hospital. As a best practice, the research suggests that hospitals using AI models developed with external, out-of-distribution data should carefully assess the algorithms on their own patient data to understand the performance accuracy across various demographics such as race and gender.

“This questions the effectiveness of developer assurances on model fairness at the time of testing and highlights the need for regulatory bodies to consider real-world performance monitoring, including fairness degradation,” concluded the researchers.

Copyright © 2024 Cami Rosso. All rights reserved.

advertisement
More from Cami Rosso
More from Psychology Today
More from Cami Rosso
More from Psychology Today