Healthcare providers view artificial intelligence (AI) with a mix of excitement and trepidation. This groundbreaking technology offers a tantalizing promise: AI could revolutionize clinical efficiency by automating mundane tasks like note-taking, sending emails, and filling out forms.. However, there is also the fear that AI might render many clinicians obsolete. As the capabilities of AI expand, will we find that AI can be used to accurately diagnose patients? Will AI soon be making treatment recommendations?
In a recent study, Roy Perlis, MD MSc, Director of the Center for Quantitative Health at Mass General and colleagues explored the utility of a large language model in identifying treatment options for bipolar disorder. In this study, the research team focused on the treatment of bipolar depression which carries a significant risk of morbidity and mortality. While a range of evidence-based treatments for bipolar depression are available, there is considerable variability and disagreement in how to optimally treat bipolar depression.
This study evaluated the ability of two large language models (LLMs) as decision support tools: one without fine-tuning and the other augmented with evidence-based guidelines for the pharmacologic treatment of bipolar depression. While a large language model like GPT-4 can access and analyze an extensive body of information in a matter of seconds, it has limitations. It may not have access to the very latest medical research or drug approvals. Nor do the recommendations reflect clinical experience or judgment.
It is possible to improve the validity and reliability of the information generated by the LLM by fine-tuning or training the model. For this study, the team augmented the LLM (GPT-4) with evidence-based guidelines for the pharmacologic treatment of bipolar depression.
The unaugmented and augmented models were used to analyze 50 clinical vignettes and to identify the five best next-step treatment options and the five worst or contraindicated next-step treatments. The same vignettes were presented to three bipolar disorder experts and to a panel of community prescribers with experience in treating bipolar disorder.
- The level of agreement between the augmented model and expert opinion was fair (Cohen’s κ, 0.31); the augmented model selected the expert-designated optimal treatments 50.8% of the time, compared to 23.4% for the unaugmented model.
- The augmented model outperformed a sample of community clinicians, who selected the optimal treatment 23.1% of the time on average.
- However, the augmented model still made poor or contraindicated recommendations 12% of the time.
- There were some modest but statistically significant differences in the model’s performance based on patient gender and race, suggesting potential for bias.
Based on the findings of the current study, Perlis and colleagues conclude that large language models augmented with evidence-based guidelines show promise as a clinical decision support tool for making recommendations for patients with bipolar depression. However, they emphasize that randomized trials are needed to determine whether the application of the augmented model can improve clinical outcomes without increasing risk to patients. Furthermore, developing strategies to avoid clinician overreliance on AI recommendations are needed.
Read More
Perlis RH, Goldberg JF, Ostacher MJ, Schneck CD. Clinical decision support for bipolar depression using large language models. Neuropsychopharmacology. 2024 Aug; 49(9):1412-1416.
Roy Perlis, MD, MSc is the Director of the Center for Quantitative Health at MGH and Associate Chief for Research in the Department of Psychiatry. He is the Ronald I. Dozoretz, MD Endowed Professor of Psychiatry at Harvard Medical School and Associate Editor (Neuroscience) at JAMA's new open-access journal, JAMA Network - Open. His research is focused on identifying predictors of treatment response in brain diseases, and using these biomarkers to develop novel treatments. He directs two complementary laboratory efforts, one focused on patient-derived cellular models and one applying machine learning to large clinical databases.