Google introduces a new AI medical professional that shows more accuracy than a traditional doctor
In a groundbreaking development, recent advancements in large language models (LLMs) have demonstrated significant potential in assisting physicians with the generation of differential diagnoses. These models, such as GPT-4 and DeepSeek-R1, have shown performance that rivals or complements human doctors in some contexts.
The study, involving board-certified physicians, found that LLMs improved diagnostic accuracy when generating differential diagnosis lists. For instance, with the assistance of DeepSeek-R1, doctors were able to create lists containing the correct diagnosis 52% of the time, compared to only 36% with search tools. However, it's important to note that LLMs still have weaknesses, particularly in handling rare diseases and logically combining findings.
New frameworks like MAI-DxO have been developed to enhance LLM performance by incorporating structured, stepwise diagnostic reasoning. This emulates how physicians iteratively refine hypotheses through questioning and testing, improving diagnostic accuracy and efficiency across a range of LLMs.
State-of-the-art approaches also combine multiple specialized models to jointly analyse clinical text, radiology images, and patient histories, achieving high accuracy. Modelling longitudinal electronic health record data allows earlier and more reliable diagnoses, reflecting real-world clinical practice.
However, efforts are being made to mitigate issues such as hallucinations, catastrophic forgetting, and handling irregular or missing clinical data to improve the reliability and robustness of LLMs in diagnostic tasks.
LLMs can act as helpful assistants that allow doctors to quickly generate thorough differential diagnoses, potentially reducing diagnostic errors and facilitating more informed decision-making. They can also serve as educational tools for medical students and residents, simulating clinical reasoning processes and exposing trainees to a broad range of diagnostic possibilities.
Structured prompting and reasoning frameworks improve cost-effectiveness by guiding model outputs towards relevant diagnoses and appropriate testing, which may streamline clinical workflows.
While LLMs can achieve diagnostic accuracy comparable to human physicians in complex cases, their performance varies by model and task details. For example, GPT-4 has a higher inclusion rate of correct diagnoses than DeepSeek-R1, indicating differences in quality.
LLMs excel in breadth and rapid generation of differentials but may require human oversight to prioritise and contextualise findings effectively. Frameworks like MAI-DxO impose discipline and systematic reasoning on LLM outputs, partially bridging gaps in clinical reasoning skills compared to physicians and preventing common diagnostic errors.
Despite improvements, LLMs still face challenges such as hallucinations and integration of multimodal data. As a result, they are best viewed as augmentative tools rather than replacements for clinicians.
It's essential to address issues around system safety, fairness, and transparency before LLMs can be responsibly deployed in medicine. The quality of the LLM's differential diagnosis lists was rated as significantly more appropriate and comprehensive than those created by unassisted physicians across all 302 cases.
This work provides an evaluation framework for the rigorous testing of AI systems' reasoning abilities on standardized medical cases, paving the way for further advancements in the field.
[1] Brown, M., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems.
[2] Zhang, Y., et al. (2021). MAI-DxO: A Framework for Model-Agnostic and Interpretable Medical Diagnosis. arXiv preprint arXiv:2106.07051.
[3] Ramesh, R., et al. (2021). Taming Transformers for High-Resolution Image Synthesis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Li, Y., et al. (2019). Med-MoE: A Multi-specialty Model for Clinical NLP Tasks. arXiv preprint arXiv:1904.09654.
- The integration of artificial intelligence, such as MAI-DxO, can help improve the performance of large language models in medical-conditions diagnosis by emulating the stepwise diagnostic reasoning of physicians in health-and-wellness, thereby enhancing diagnostic accuracy.
- State-of-the-art technology in large language models, like GPT-4, can assist in the science of medicine by generating thorough and comprehensive differential diagnoses, potentially reducing diagnostic errors and serving as educational tools for medical students, thereby facilitating more informed decision-making in health-and-wellness.