Babylon’s ‘chatbot’ claims were no more than clever PR
Claims about performance of its ‘chatbot’ were worryingly conflated with peer-reviewed academic evidence, warns Dr Sam Finnikin
The headlines generated last week when Babylon Health announced that its ‘chatbot’ performed as well as GPs in the MRCGP exam have generated much debate. Whilst people argue about where artificial intelligence may fit into healthcare, to me, this saga highlights the tactics of public relation machines and the inadequacies of the popular press.
Let’s clear one thing up. This was a cleverly orchestrated public relations campaign, not a piece of academic research. However, Babylon refers to the work as a ‘research paper’ on its twitter feed and the paper it refers to is clearly formatted in a manner to makes it look like it’s been published in an academic journal. The only discernible difference between this ‘paper’ and something that has been through an academic publishing team is where the journal’s name would usually appear. Babylon has just written ‘Babylon Health’ in this header. Not forgetting, of course, there are no conflict of interest statements like those expected on an academic paper, presumably because the majority of the authors are Babylon employees.
To add a further sheen of respectability, the paper was ‘launched’ at an event held at the Royal College of Physicians. This suggests to the outside world that there was a level of endorsement by the RCP.
The content of the paper itself has been scrutinised and criticised by those more expert than myself. The ‘research’ is presented as a piece of academic work, but has not been published in a peer-review journal. Thus, Babylon avoids the inconvenience of peer-review which would highlight the shortcomings of the ‘research’ and may even preclude publication.
We need proper academic research to provide evidence on how best to incorporate technology into our lives
Of course, none of this would have mattered if the press did their job properly. Journalists should be critical of the information presented in press releases and not be bamboozled by the web of respectability weaved by PR companies. But the headlines last week demonstrated that this is not the case. The information was treated in the same way as an academic paper would have been. That is, with very little regard for the reliability of the science but with an obsession for attention-grabbing headlines.
Artificial intelligence is going to be a part of medicine in the future, and will hopefully augment clinical decision making in a positive way. However, to get to this future, we need proper academic research to provide evidence on how best to incorporate technology into our lives. It is not acceptable to allow technologies to invade healthcare without thorough evaluation, and even worse to use sick people as an unwitting proving ground. If Babylon wants to make claims about its chatbot, then it should be obliged to back it up with independent and robust evidence.
Dr Sam Finnikin is a salaried GP in Sutton Coldfield and clinical research fellow at the University of Birmingham
A Babylon spokesperson said: ‘Babylon’s recently released paper, entitled A comparative study of artificial intelligence and human doctors for the purpose of triage and diagnosis (available here on arXiv), has been co-authored by academics from Stanford University (Prof. Megan Mahoney, Chief of General Primary Care, Division of Primary Care and Population Health), and Yale New Haven Health (Arnold DoRosario, Chief Population Health Officer), alongside research scientists and doctors at Babylon.
‘Giving colleagues across multiple disciplines early visibility vis-à-vis our ongoing research is standard practice in the academic community and has been offered in order to promote discussion of our preliminary findings as well as give transparency regarding our methodology. We will publish a more extensive, peer-reviewed study later in the year.
‘Both Babylon and our research collaborators stand by the results outlined in the paper (whilst recognising some limitations of the study, as described in the paper itself). That our AI has, in these tests, achieved equivalent accuracy with human doctors and demonstrates safety levels of 97% should be acknowledged for what it is – a significant advancement in medicine. Our AI’s capabilities not only show that it is possible for anyone - irrespective of their geography, wealth or circumstances - to have access to health advice that is on-par with top-rated practising clinicians but are an early indicator as to how AI-augmented health services can potentially reduce the burden on doctors and healthcare systems around the world. We believe that our results take humanity a significant step closer to achieving a world where no-one is denied safe and accurate health advice. And that has got to be a better place – for everyone.’
1. Babylon’s AI, in a series of tests (including the relevant sections of the MRCGP exam), demonstrated its ability to provide results which are on-par with practicing clinicians. The tests carried out relate to the diagnostic exams taken by trainee doctors as a benchmark for accuracy, as set by the Royal College of General Practitioners (RCGP). The average pass mark over the past five years for real-life doctors was 72%. Babylon’s AI scored 81%. As the AI continues to learn and accumulate knowledge, Babylon expects that subsequent testing will produce significant improvements in terms of results. Babylon’s AI, however, for regulatory reasons, remains an information service, rather than a medical diagnosis.
2. Babylon’s team of scientists, clinicians and engineers recently collaborated with the Royal College of Physicians, Dr Megan Mahoney (Chief of General Primary Care, Division of Primary Care and Population Health, Stanford University), and Dr Arnold DoRosario (Chief Population Health Officer, Yale New Haven Health) to test Babylon’s AI alongside seven highly-experienced primary care doctors using 100 independently-devised symptom sets (or ‘vignettes’). In these tests, Babylon’s AI scored 80% for accuracy, while the seven doctors achieved an accuracy range of 64-94%. In these tests, accuracy of the AI was 98% when assessed against conditions seen most frequently in primary care medicine. In comparison, when Babylon’s research team assessed experienced clinicians using the same measure, their accuracy ranged from 52-99%. Crucially, the safety of the AI was 97%. This compares favourably to the doctors, whose average was 93.1%. Babylon’s research paper, entitled A comparative study of artificial intelligence vs human doctors for the purpose of triage and diagnosis, can be downloaded from arXiv.