ChatGPT failed to spot over 50% of medical emergencies, study finds

ChatGPT regularly fails to spot when someone needs urgent medical care when users ask for health-related advice, researchers have reported.

A study published in Nature Medicine found the AI tool also frequently fails to detect suicidal ideation, and it commonly tells people to seek medical advice when it is not needed.

It comes as chief medical officer Professor Sir Chris Whitty recently warned that GPs are increasingly having to start their consultations by ‘undoing’ incorrect information that AI has provided to patients.

The study is the first independent safety evaluation of ChatGPT Health – which was launched in January this year as OpenAI’s consumer health tool and has already reached millions of users – with researchers testing its response to 60 patient scenarios covering everything from mild illness to life-threatening emergencies.

For each scenario they also tried out variations such as changing the patient’s gender, adding test results to the chat or adding comments from other family members to generate 1,000 responses from ChatGPT overall.

The AI responses were compared with assessments provided by doctors. Their tests showed ChatGPT ‘under-triaged’ more than half the cases it was presented with, the researchers found.

For some textbook emergencies, including stroke or severe allergic reactions, the AI platform worked well, they noted.

But for some other situations that would require a patient to attend hospital immediately, it suggested staying home or booking a routine appointment.

This included one asthma scenario, where ChatGPT advised waiting rather than seeking emergency treatment despite it also identifying early warning signs of respiratory failure, they said.

Overall, the researchers found a U-shaped pattern where ‘the most dangerous failures were concentrated at clinical extremes’.

Among gold-standard emergencies, the system under-triaged 52% of cases, directing patients with diabetic ketoacidosis or impending respiratory failure to be seen in 24-48 hours rather than immediately in the emergency department.

It also only accurately identified non-urgent cases 35% of the time, suggesting people seek medical advice when they didn’t need it.

The US researchers said the findings raise safety concerns that warrant validation before ‘consumer-scale deployment of artificial intelligence triage systems’.

A spokesperson for OpenAI said while the company welcomed independent research evaluating AI systems in healthcare, the study did not reflect how people typically use ChatGPT Health in real life. The model is also continuously updated and refined.

Professor Azeem Majeed, a GP and professor of primary care and public health at Imperial College London, said: ‘AI-based triage tools, as evaluated in the Nature Medicine study, are not yet sufficiently reliable to guide decisions about when to seek urgent medical care.

‘The findings highlight a risk that such tools may create a false sense of reassurance, particularly in serious or emergency situations.’

He said that for patients, this means AI triage should be viewed only as a source of general information and not as a substitute for established emergency care pathways, professional clinical assessment, or the safety-netting advice provided by their GP on when to seek further help.

He added: ‘While these technologies will improve over time, it is important that patients remain aware of their current limitations and seek timely access to appropriate medical care when needed.’

The Government plans to introduce an AI-enabled ‘GP in your pocket’ feature of the NHS app to handle non-urgent care enquiries by 2028.

And a recent Health Foundation analysis suggested the feature could face resistance from patients.

READERS' COMMENTS [2]

Please note, only GPs are permitted to add comments to articles

Nick Mann 23 March, 2026 5:26 pm

Whilst AI decision support tools are already in use by doctors on the wards in USA – apparently well received – independent evaluations are lacking. The potential for automation bias remains large. However, these tools are at least trained on academic research, analysis and guidance.

ChatGPT Health and proposed ‘GP doctor in your pocket’ are designed for use by patients and this represents a completely different level of AI training and attendant risk. Patients frequently input answers which are at variance to the point of the triaging questions. Even with the correct inputs, the above research highlights how very unsafe this tool appears to be. The AI will ‘learn’ from previous triaging episodes, but there is no feedback loop, so the AI won’t learn when it’s got it wrong before.
It is absolutely astonishing that there is no systematic, independent evaluation of AI products before they are due to hit patients, and this poses unacceptable risks. Regulatory safeguards are being routinely dismissed in favour of convenience, lobbying/hype, and political ambition.
When patients started dying in large numbers from the inexorably increasing 12hr A&E waits (ToA), the government had the knowledge and the agency to evaluate and act from around 2015 onwards.
Jeremy Hunt and Simon Stevens instead chose to cover up that available data in order to push their now familiar narratives of ‘transformation’. Reports and remedies were available, but they chose to sacrifice tens of thousands of patients needlessly dying, regardless. Meanwhile purporting to champion patient safety.
I feel very strongly that the most senior levels of NHS ‘leadership’ has engrained groupthink and has perpetuated a deeply sick culture, where the messaging has overtaken facts and people are rendered conveniently dispensable.
AI has promise in many areas, but also has the potential for huge harm if not adequately evaluated, controlled, and regulated. Profit-hungry US industry is dominating these developments whilst government stands in awe and remains credulous to the point of dangerous. Wake up. stand up.

So the bird flew away 23 March, 2026 8:17 pm

“A spokesperson for OpenAI said while the company welcomed independent research” – are they joking? As long as Chatgpt Health’s dataset is private and not publicly released, how can generative AI researchers conduct an independent evaluation? OpenAI have, however, released a “benchmark dataset” but this may obviously suffer from incompleteness, bias, and testing to their closed rubrics… Hallucinations, misinformation and major errors are bound to occur.
Agree with Nick that the usual caveats re regulation, guardrails, and public ownership apply.