Documenting patient care, including compiling diagnostic reports, writing progress notes, recording medication administration, and synthesizing a patient’s treatment history across different specialists, is a crucial part of the health care system.

This record keeping takes up a lot of time for health care professionals, which has only increased with electronic health records. Doctors are estimated to spend two hours documenting patient care for every hour of patient interaction and 60% of nurses’ time is devoted to record keeping. Also, for all the efforts put into electronic templates, entries in patient health records can be inconsistent between individual health professionals, often shorthand or opaque in detail and more concerningly, sometimes incomplete or inaccurate.

While there have been many studies of finetuning AI to the medical realm, a recent Stanford University study (Van Veen, D., Van Uden, C., Blanemeier, L. et al.) for the first time assesses the use of AI as a real time medical record taker across a diverse range of clinical tasks and compared to human clinicians. The study also included a safety analysis of the risks of AI hallucinating information about patients.

Methodology

The Stanford study used eight different AI models, some open-source (including Llama, Alpaca and FLAN-UL2) and some closed source (ChatGPT3.5 and ChatGPT4). Some models already had been specifically trained on medical data, such as Alpaca-Med, while others such as ChatGPT4 have demonstrated strong performance on biomedical NLP tasks such as medical exams. The Stanford team further adapted the models by two light weight training methods (i.e. training that would be well within the reach of health institutions): In-context Learning which does not alter the model (does not touch the weights or parameters) but provides the AI with a handful of examples to learn how to better process prompts from the user; and Quantized Low-Rank Adaptation which alters a small subset of weights to insert matrixes more closely suited to the specific task for which the model is to be used.

The study focused (and trained the eight AI models) on four summation tasks:

  • Radiology reports: the AI models were tasked with concisely capturing the most salient, actionable information from a patient’s radiology report. The models were trained on data from radiology reports spanning seven anatomies (head, abdomen, chest, spine, neck, sinus, and pelvis) and two modalities (magnetic resonance imaging - MRI - and computed tomography - CT), together with the accompanying free-text radiology notes, from the Beth Israel Deaconess Medical Centre between 2011 and 2016.

  • Patient questions: the AI models were tasked with generating a condensed question expressing the minimum information required to record a patient’s question and the answer. The AI models were trained on patient health questions of varying verbosity and coherence selected from messages sent to the U.S. National Library of Medicine.

  • Patient progress: the AI models were tasked with generating a “problem list” for each patient following a clinician’s bedside visit. The AI models were trained on de-identified data from hospital intensive care unit (ICU) admissions.

  • Dialogue: the AI models were tasked with summarising a doctor-patient conversation into an “assessment and plan” paragraph. The AI models were trained on 207 doctor-patient conversations and corresponding patient visit notes.

A key part of the model training was conciseness, otherwise “the model might generate lengthy outputs - occasionally even longer than the input text” .

The study compared notes recorded by a clinician and separately by AI following the clinician’s visit to a particular patient or review of the patient’s tests: the same medical intervention was recorded by human-only or by AI-only.

A team of specialist clinicians reviewed a random sample of 100 summations of the radiological reports, patient records and patient progress notes (the dialogue notes were considered too unwieldy to review) and the matching notes prepared in the other stream. Which means the reviewer would be viewing, on an individually de-identified basis, both the human-only and the AI notes for the same medical intervention. Each pair of human-only and AI-only notes were given a grade for completeness, correctness, and conciseness.

Results

Overall, the study found the AI received higher marks for its homework than the clinicians: “...[human-only] summaries are preferred in only a minority of cases (19%), while in a majority, the best [AI] model is either non-inferior (45%) or preferred (36%)...”.

Drilling into the results:

  • Completeness: of the three criteria, AI performed stronger on completeness than humans-only. This was not a function of the AI writing more than the time-pressed clinicians because the length of machine-generated and human-only responses were broadly similar: for example, 7 + 24 (AI) vs. 44 + 22 (clinicians) tokens for radiology reports.

  • Conciseness: not only were the AI summaries considered by the reviewers to be more complete, they also marked the AI as being more concise overall and on the individual tasks of patient questions and patient progress. Only on radiological reports did the reviewers consider the clinician-only summations were more concise (though, perhaps oddly, they still marked the clinician-only radiological summations as being less complete!). However, the researchers thought the AI could do better if the training were more specific, such as setting a bright line ceiling on summation length (for example no more than 15 words).

  • Correctness: the best AI model generated significantly fewer errors compared to clinician-only summaries overall and on two of three summarization tasks (radiology reports and patient questions). The Stanford researchers noted that:

“As an example of the model’s superior correctness performance on radiology reports, we observe that it avoided common medical expert errors related to lateral distinctions (right vs. left).”

Do doctors or AI hallucinate more?

AI has a well documented (if not inherent) tendency to hallucinate - but apparently so do clinicians. The Stanford study gave the following example (perhaps with a touch of irony):

during the blinded study, [a reviewer] erroneously assumed that a hallucination - the incorrect inclusion of a urinary tract infection - was made by the model. In this case, the medical expert was responsible for the hallucination. This instance underscores the point that even medical experts, not just LLMs, can hallucinate.

In fact, the Stanford study found doctors ‘out-hallucinated’ AI, as the following table shows:

 

Given the [AI] model’s lower error rate in each category, this suggests that incorporating LLMs could actually reduce fabricated information in clinical practice.

While any errors in medical records are to be avoided, some errors can have greater harm for patients than others. The safety analysis required reviewers to assess summarization errors they detected to medical harm . The reviewers assessed the identified errors (on a blind basis as between clinician and AI-generated errors) against the Agency for Healthcare Research and Quality (AHRQ)’s harm scale. 

The results were that the human-only generated errors would have both a higher likelihood (14%) and higher extent (22%) of possible harm compared to the summaries from the best model (12% and 16%, respectively).

Model comparison

As between the AI models, GPT tended to perform the best overall. Interestingly, the versions of models fine-tuned for medical applications, such as Alpaca-Med, did not perform as well as the general versions of the same model.

Models adapted for the study using In-context Learning (ICL)—the lightest form of adaptation—performed better than models adapted using the more ‘intrusive’ quantized low-rank adaptation (QloRA).

While GPT (a proprietary model) performed best overall, the open-source models performed very well. When combined with ICL, this suggests relatively low barriers to training and deployment of AI by individual medical institutions. The Stanford researchers noted that, although open-source AI carries some different governance issues:

ICL [on open-source models] provides many benefits: (1) model weights are fixed, hence enabling queries of pre-existing LLMs (2) adaptation is feasible with even a few examples, while fine-tuning methods such as QLoRA typically require hundreds or thousands of examples.

Conclusion

The Stanford study concludes:

Evidence from this study suggests that incorporating LLM-generated candidate summaries into the clinical workflow could reduce documentation load, potentially leading to decreased clinician strain and improved patient care.

AI use in generating medical records might even rid the medical system of the notorious problem of physician’s scrawl interestingly, a US study has found lower legibility than average was associated with being an executive and being male!

Read more here: Adapted large language models can outperform medical experts in clinical text summarization