Are AI devices evaluated appropriately?
Source: CC0 Public Domain

Are AI devices evaluated appropriately?

Discerning good algorithms from bad ones: A study from Stanford University found limitations in the Food and Drug Administration’s approval process.

In just the last two years, artificial intelligence has become embedded in scores of medical devices that offer advice to ER doctors, cardiologists, oncologists, and countless other health care providers. The FDA has approved at least 130 AI-powered medical devices, half of them in the last year alone, and the numbers are certain to surge far higher in the next few years.

Several AI devices aim at spotting and alerting doctors to suspected blood clots in the lungs. Some analyze mammograms and ultrasound images for signs of breast cancer, while others examine brain scans for signs of hemorrhage. Cardiac AI devices can now flag a wide range of hidden heart problems.

But how much do either regulators or doctors really know about the accuracy of these tools? A new study led by researchers at Stanford, some of whom are themselves developing devices, suggests that the evidence isn’t as comprehensive as it should be and may miss some of the peculiar challenges posed by artificial intelligence.

Many devices were tested solely on historical — and potentially outdated — patient data.  Few were tested in actual clinical settings, in which doctors were comparing their own assessments with the AI-generated recommendations. And many devices were tested at only one or two sites, which can limit the racial and demographic diversity of patients and create unintended biases.

“Quite surprisingly, a lot of the AI algorithms weren’t evaluated very thoroughly,’’ says James Zou, the study’s co-author, who is an assistant professor of biomedical data science at Stanford University as well as a faculty member of the Stanford Institute for Human-Centered Artificial Intelligence (HAI).

In the study, the Stanford researchers analyzed the evidence submitted for every AI medical device that the FDA approved from 2015 through 2020.

Testing challenges, biased data

In sharp contrast to the extensive clinical trials required for new pharmaceuticals, the researchers found, most of the AI-based medical devices were tested against “retrospective” data — meaning that their predictions and recommendations weren’t tested on how well they assessed live patients in real situations but rather on how they might have performed if they had been used in historical cases.

One big problem with that approach, says Zou, is that it fails to capture how health care providers use the AI information in actual clinical practice. Predictive algorithms are primarily intended to be a tool to assist doctors — and not to substitute for their judgment. But their effectiveness depends heavily on the ways in which doctors actually use them.

The researchers also found that many of the new AI devices were tested in only one or two geographic locations, which can severely limit how well they work in different demographic groups. “It’s a well-known challenge for artificial intelligence that an algorithm may work well for one population group and not for another,” says Zou.

Revealing significant discrepancies

The researchers offered concrete evidence of that risk by conducting a case study of a deep learning model that analyzes chest X-rays for signs of collapsed lungs. The system was trained and tested on patient data from Stanford Health Center, but Zou and his colleagues tested it against patient data from two other sites — the National Institute of Health in Bethesda, Md., and Beth Israel Deaconess Medical Center in Boston. Sure enough, the algorithms were almost 10 percent less accurate at the other sites. In Boston, moreover, they found that their accuracy was higher for white patients than for Black patients.

AI systems have been famously vulnerable to built-in racial and gender biases, Zou notes. Facial- and voice-recognition systems, for example, have been found to be much more accurate for white people than people of color. Those biases can actually become worse if they aren’t identified and corrected.

Zou says AI poses other novel challenges that don’t come up with conventional medical devices. For one thing, the datasets on which AI algorithms are trained can easily become outdated. The health characteristics of Americans may be quite different after the COVID-19 pandemic, for example.

Perhaps more startling, AI systems often evolve on their own as they incorporate additional experience into their algorithms. “The biggest difference between AI and traditional medical devices is that these are learning algorithms, and they keep learning,” Zou says. “They’re also prone to biases. If we don’t rigorously monitor these devices, the biases could get worse. The patient population could also evolve.”

“We’re extremely excited about the overall promise of AI in medicine,” Zou adds. Indeed, his research group is developing AI medical algorithms of its own. “We don’t want things to be over-regulated. At the same time, we want to make sure there is rigorous evaluation especially for high-risk medical applications. You want to make sure the drugs you are taking are thoroughly vetted. It’s the same thing here.”

The research was published in Nature Medicine.

Subscribe to our newsletter

Related articles

AI makes no-cath forecast

AI makes no-cath forecast

Researchers use AI software to predict coronary artery plaque composition and significance without the risks of invasive procedures.

Making AI understandable – constructing explanation processes

Making AI understandable – constructing explanation processes

Human-machine interaction is complex. Researchers investigate a new form of interaction between humans and machines.

Smart speakers contactlessly monitors heart rhythms

Smart speakers contactlessly monitors heart rhythms

Researchers have developed system for smart speakers to monitor both regular and irregular heartbeats without physical contact.

Key task in computer vision gets a boost

Key task in computer vision gets a boost

A researcher has demonstrated a technique that reduces the computing time for non-rigid point set registration relative to other approaches.

Would we be able to control superintelligent machines?

Would we be able to control superintelligent machines?

Using theoretical calculations, scientists showed that it would not be possible to control a superintelligent AI.

AIs detect diabetic eye disease inconsistently

AIs detect diabetic eye disease inconsistently

Although some artificial intelligence software tested reasonably well, only one met the performance of human screeners.

Biosensors convert signals to control prosthetics

Biosensors convert signals to control prosthetics

High-five or thumbs-up? Researchers have developed a wearable sensor that detects which hand gesture you want to make.

AI test predicts eye disease years before symptoms

AI test predicts eye disease years before symptoms

A new eye test may predict wet age-related macular degeneration, a leading cause of severe sight loss, three years before symptoms develop.

Next-generation computer chip for AIs

Next-generation computer chip for AIs

Engineers have developed a next-generation circuit that allows for smaller, faster and more energy-efficient devices – which would have major benefits for AI systems.

Popular articles

Subscribe to Newsletter