AI models outperform GPs in MRCGP-style exams

Artificial intelligence (AI) tools have outscored GPs in clinical exams based on the MRCGP, which researchers argued strengthens the case for their role in supporting general practice.
Researchers from University College London, King’s College London and Stanford University tested four large language models (LLMs) on 100 multiple-choice questions designed to mirror the Applied Knowledge Test (AKT) component of the MRCGP.
While GPs scored an average of 73%, the AI models achieved results ranging from 95% to 99%. The strongest performance came from OpenAI’s ‘o3’, which scored 99%. Meanwhile, Anthropic’s Claude Opus 4, xAI’s Grok-3 and Google’s Gemini 2.5 Pro each scored 95%.
The study – published as a preprint on arXiv and not yet peer reviewed – found that LLMs can perform at a very high level on clinical knowledge tests, but cautioned that test performance is not the same as safe practice.
The paper said: ‘All models performed remarkably well, and all substantially exceeded the average performance of GPs and GP registrars who had answered the same questions. o3 demonstrated the best performance, while the performances of the other leading models were comparable with each other and were not substantially lower than that of o3.
‘These findings strengthen the case for LLMs – particularly reasoning models – to support the delivery of primary care, especially those that have been specifically trained on primary care clinical data.’
The researchers cautioned that while LLMs can serve as powerful educational and decision support tools, they are not a substitute for clinicians.
‘LLMs should be viewed as powerful adjuncts to, rather than replacements for, clinicians. While they can provide valuable support in education and decision-making, they lack the contextual understanding, empathy and accountability that are central to safe and effective medical practice,’ the paper said.
RCGP chair Professor Kamila Hawthorne said: ‘Practising as a GP is far more than having good clinical knowledge – although that is, of course, important – it is about having good communication and consultation skills, being able to consider multiple factors that may be impacting on a patient’s health in order to make a diagnosis in partnership with them, and balancing risk.
‘This is why the MRCGP exam is a tripartite assessment consisting of the Simulated Consultation Assessment and continuous workplace-based assessment throughout GP training, as well as the Applied Knowledge Test, which this research attempts to mimic.’
Professor Hawthorne also noted that the researchers in the study ‘did not have access to the RCGP’s AKT question bank – and that any GP registrars sitting the AKT would not be able to use AI given it is conducted under strict exam conditions’.
The Government’s 10-year plan for health placed significant emphasis on the potential of AI, highlighting its role in diagnostics, triage and reducing administrative burden.
Pulse has previously reported on the growing use of ambient scribe technology in GP practices for taking patient notes. Although the tools may save consultation time, concerns have been raised over medico-legal responsibility.
A recent study also suggested ChatGPT was helpful for summarising patient notes.
Professor Hawthorne said: ‘AI does have huge potential to support primary care education, and support GPs in the delivery of patient care – and we would welcome more research into this area. But the scope of this study does not fully account for the nuances in GP training or the breadth of professional skills that the MRCGP assesses.’
The UK medicines regulator has urged GP practices to report all adverse incidents and inaccuracies with AI tools used in clinical practice to its Yellow Card Scheme.
Related Articles
READERS' COMMENTS [1]
Please note, only GPs are permitted to add comments to articles
If there was an exam for politicians aspiring to become an MP (call it “the PRICs” – Politician’s Royal Institute College exams), then I’m 100% sure that 100% of the time AI would beat 100% of real MPs. Can we replace MPs with AI please?
#statistics.mismeasure.validity