Harvard study finds AI outperformed two doctors in emergency room diagnoses

AI vs. ER doctors
A new Harvard-led study suggests large language models may be better than human physicians at diagnosing some emergency room patients, at least in controlled tests using real clinical cases.
Published this week in Science, the research examined how OpenAI’s o1 and 4o models performed across several medical settings, including one experiment based on 76 patients who came through the Beth Israel emergency room. The AI systems were compared with diagnoses from two internal medicine attending physicians, and the results were reviewed by two other attending physicians who did not know whether a diagnosis came from a doctor or a model.
According to the study, o1 “either performed nominally better than or on par with the two attending physicians and 4o” at each diagnostic stage. The difference was most noticeable at the first step, when ER staff had the least information and the greatest urgency to make the right call.
Strongest edge at triage
Harvard Medical School said the researchers did not pre-process the data before testing the models. Instead, the AI systems received the same information available in the electronic medical record at the time each diagnosis was made.
Using that information, o1 produced the exact or very close diagnosis in 67% of triage cases. One physician reached that mark 55% of the time, while the other did so in 50% of cases.
“We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said Arjun Manrai, who leads an AI lab at Harvard Medical School and is one of the study’s lead authors, in the school’s press release.
The findings add to growing evidence that large language models can be strong diagnostic tools in certain settings, though the study stops short of saying they are ready to replace doctors in real-world emergencies.
Sources:
Doppler VPN: 6 server locations, VLESS protocol, zero tracking. Get started free.