Artificial intelligence is being used more widely in healthcare, but a new study suggests it may not be as reliable as it appears, especially in how it “thinks” through a case.
Researchers from Mass General Brigham tested 21 large AI models, including ChatGPT, Gemini, Claude, Grok and DeepSeek, by asking them to act like doctors across 29 clinical cases.
The findings, published in JAMA Network Open, show a clear gap. When given complete patient information, the models arrived at the correct final diagnosis more than 90% of the time.
But when researchers looked at the step-by-step reasoning, such as listing possible causes and deciding what tests to run, the models performed poorly. In fact, they failed to produce an appropriate list of possible diagnoses more than 80% of the time.
This means that while AI may land on the right answer, it often struggles with the process doctors rely on, a key part of safe medical care.
"These models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn't much information," said Arya Rao, lead author and researcher.
To understand this better, researchers used a new scoring system that assessed different stages of diagnosis, from early thinking to treatment decisions.
Scores ranged from 64% for Gemini 1.5 Flash to 78% for newer models like Grok 4 and GPT-5, showing uneven performance.
In simple terms, researchers found that while the latest tools could give the correct final diagnosis, they struggled with the earlier steps of thinking through a case.
They had difficulty listing possible causes and dealing with
uncertainty.
The study also found that AI improved when given more data, such as lab results and scans. But in real life, doctors rarely have all the information upfront.
Researchers warn that this gap could make AI misleading in many real-world situations, particularly in early diagnosis when uncertainty is highest.
For example, models like GPT-4o, Claude, and DeepSeek were clearly more accurate when asked for a final diagnosis than when asked what tests to run or what conditions to consider first.
Among all tasks, listing possible causes (an early and critical step) was where AI performed the worst.
Tasks like deciding on treatment or handling mixed clinical questions fell somewhere in between not the best, but not the worst either.
The study also looked at whether AI performs better when given images, such as X-rays, CT scans, or heart readings, instead of just text.
Some models, including GPT-4.5, Claude, Gemini, and Grok, did slightly better when images were included.
But this was not true for all models. Overall, performance with images was inconsistent, while text-based questions gave more stable results.
“AI is not ready to replace doctors,” said Dr Marc Succi. “It can support them, but only with close oversight.”
AI can assist, but human judgment remains essential, especially when it comes to thinking through a diagnosis, not just naming it.
The study concluded that although newer versions are improving, these AI tools are still not reliable enough to be used safely on their own in healthcare.
They continue to fall short when it comes to thinking like a doctor.