Skip to content
← Back to Publish Online

Assessing the diagnostic competence of large language models in lung ultrasound through text and image-based evaluation

LLMs in lung ultrasound

Research Article DOI: 10.4328/ACAM.22956

Authors

Affiliations

1Department of Radiology, Faculty of Medicine, Ankara 29 Mayis State Hospital, Ankara, Turkey

2Department of Radiology, Faculty of Medicine, Ankara Mamak State Hospital, Ankara, Turkey

3Department of Radiology, Faculty of Medicine, Ankara Bilkent City Hospital, Ankara, Turkey

Corresponding Author

Abstract

Aim Large language models (LLMs) are increasingly explored in radiology for knowledge retrieval and decision support. Lung ultrasound (LUS) is an artifact- driven, point-of-care modality that demands expert pattern recognition and clinical integration. We compared two state-of-the-art LLMs with radiologists across text-based and image-based LUS tasks.
Materials and Methods In this cross-sectional study, two LLMs (ChatGPT-5 and Gemini 2.5 Pro) and two radiologists—a junior radiologist (JR) and a senior radiologist (SR)—were assessed. First, performance was evaluated with 25 multiple-choice questions (MCQs) covering core LUS domains. Next, 25 LUS PNG images paired with clinical vignettes were presented, and participants answered four standardized questions per case: (1) normal vs. pathological LUS; (2) pleural effusion present/absent; (3) consolidation present/absent; and (4) B-lines present/absent. Responses were benchmarked against a reference standard. McNemar’s test was used for statistical comparisons.
Results LLMs achieved very high accuracy on MCQs, comparable to radiologists (p > 0.05). In image-based tasks, LLMs performed well in distinguishing normal from pathological LUS and in detecting pleural effusion, while demonstrating moderate performance for consolidation and B-line detection. There was no significant difference between the two LLMs across all image-based tasks (p > 0.05).
Discussion LLMs show strong text-based competence and promising image-based performance for detecting any abnormality and pleural effusion detection on LUS, but remain moderate for consolidation and B-line recognition. LLMs may function as adjunctive tools in lung ultrasound for clinicians.

Keywords

lung ultrasound ChatGPT performance

Introduction

LLMs have rapidly progressed from natural-language understanding to clinical decision support, stimulating growing interest within radiology workflows [1–4]. Recent comparative studies across imaging domains suggest that LLMs frequently approach expert performance, knowledge, and reliability at different radiological subjects and applications [5–8].
LUS provides bedside, radiation-free assessment for dyspnea and acute respiratory failure; also, it is critical for patients in the intensive care unit [9]. Many factors, such as the inability to position these patients appropriately and patient non- cooperation, make both the performance and evaluation of LUS difficult. Furthermore, like other ultrasound examinations, LUS is user-dependent, which limits its reliability [9–12].
Despite the accelerating literature on LLMs in radiology, targeted evaluations in LUS remain sparse [5–8]. In a recent study, Sun et al. compared the diagnostic accuracy of GPT-4- based and GPT-4V-based ChatGPT models with radiologists in musculoskeletal radiology and found that GPT-4-based ChatGPT achieved significantly higher accuracy than its vision-enabled counterpart, performing comparable to radiology residents but below board-certified radiologists [6]. These findings highlight that while text-based LLMs can approach human-level diagnostic reasoning in musculoskeletal imaging, their visual interpretation capabilities remain limited, emphasizing the need for further refinement of multimodal LLMs before clinical implementation.
This study aimed to evaluate the performance of LLMs in LUS and compare them with radiologists with different experience levels. It incorporated both a text-based knowledge assessment using multiple-choice questions (MCQs) and an image-based evaluation of LUS images. Through this dual approach, the study provides a comprehensive overview of the current capabilities and limitations of multimodal LLMs (ChatGPT-5 and Gemini 2.5 Pro) in distinguishing normal from pathological scans, detecting pleural effusions, consolidations, and B-lines, thereby offering insights into their potential role as supportive tools in point-of-care imaging.

Materials and Methods

Study Design
This cross-sectional experimental study compared the performance of ChatGPT-5 (https://chat.openai.com) and Gemini 2.5 Pro (https://gemini.google.com) with a junior radiologist (JR) and a senior radiologist (SR). The evaluation comprised a text-based knowledge assessment and an image-based assessment using LUS images. The workflow of the study is provided in Figure 1. The study adhered to recommended principles for diagnostic accuracy reporting (STARD) and used only de-identified cases [13].
Data Collection and Input–Output Process
Twenty-five original MCQs created by Radiologist 1 (R1) who board-certified (European Diploma in Radiology) and has 7 years experience in general radiology the investigators covered probe selection and technique, pleural artifacts (A- and B-lines), pneumothorax signs (lung sliding and M-mode patterns), pleural effusion patterns, consolidation (including dynamic vs. static air bronchograms), and common diagnostic profiles; items emphasized clinically actionable reasoning anchored in consensus documents and foundational LUS literature [9– 12,14,15].
Twenty-five LUS images in PNG format were selected from an open-access dataset (https://www.kaggle.com/datasets/orvile/ lung-ultrasound-imaging-dataset) that represents a balanced distribution of normal studies and key pathologies (effusion, consolidation, interstitial syndrome/B-lines) across typical scanning windows. For every case, both LLMs and radiologists answered four standardized questions for these images:
1. Normal or pathological LUS;
2. Pleural effusion present/absent;
3. Consolidation present/absent;
4. B-lines present/absent.
Prompts for LLMs
MCQs
“Act like a professor of radiology with 30 years of experience in lung ultrasound. I will ask you multiple-choice questions, each with only one correct answer. Provide only the letter of the most accurate choice, with no explanation.”
Image-based Questions (IQs)
“Act like a professor of radiology with 30 years of experience in lung ultrasound. I will provide you with LUS images. Answer:
(1) normal vs. pathological; (2) pleural effusion present/absent; (3) consolidation present/absent; (4) B-lines present/absent. Provide only the final answers without explanations.”
No model fine-tuning, pre-training, or external tools were used; default interfaces and hyperparameters were applied.
Examples of chat session models are provided in Figures 2 and 3.
Radiologists and Response Procedure
Junior Radiologist (JR)
European diploma; 7 years’ experience in general radiology
Senior Radiologist (SR)
20 years’ experience in general radiology with 8 years’ experience in thoracic radiology
Radiologists independently completed the MCQs and IQs via electronic forms without internet access.
Reference Standard and Scoring
Reference answers for MCQs and IQs were established a priori by R1 using literature-concordant definitions [9–12,14–18]. R1 evaluated the responses and scored them as correct (1) or incorrect (0).
Statistical Analysis
Distributions were examined with the Kolmogorov–Smirnov test. Because the data were non-normal, non-parametric statistics were used. Paired comparisons employed McNemar’s test. A two-sided p≤0.05 was considered statistically significant. Statistical analyses were performed using IBM SPSS Statistics, Version 28.0 (IBM Corp., Armonk, NY, USA).
Ethical Approval
All questions used in this study were created by the authors. Since no authentic patient identifiers and data were used, and all utilized cases were obtained from a publicly open dataset, ethical approval was not applicable for this study.

Results

Multiple-Choice Questions
ChatGPT-5 achieved an accuracy of 92.0% (23 / 25), and Gemini 2.5 Pro scored 88.0% (22 / 25). SR achieved 88.0% (22 / 25), while JR reached 84.0% (21 / 25). There was no statistically significant difference between the LLMs or between the LLMs and the radiologists (p > 0.05). These results indicate that both LLMs performed at a level comparable to that of experienced radiologists in text-based knowledge (Table 1).
Image-Based Questions
ChatGPT-5 correctly classified 22 of 25 images (88.0%), Gemini 2.5 Pro 20 of 25 (80.0%), JR 23 of 25 (92.0%), and SR 24 of 25 (96.0%). Both LLMs performed were comparable to JR (p > 0.05) but significantly poorer than SR (p < 0.05).
ChatGPT-5 identified pleural effusion presence/absence correctly in 21 of 25 images (84.0%), Gemini 2.5 Pro in 20 of 25 (80.0%), JR in 23 of 25 (92.0%), and SR in 24 of 25 (96.0%). No significant difference was found between the LLMs and JR (p > 0.05), while SR outperformed all other participants (p < 0.05). ChatGPT-5 correctly identified consolidation presence/absence in 16 of 25 images (64.0%), Gemini 2.5 Pro in 15 of 25 (60.0%), JR in 20 of 25 (80.0%), and SR in 23 of 25 (92.0%). Both LLMs performed significantly lower than SR (p < 0.01) but comparably to JR (p > 0.05).
ChatGPT-5 detected B-line presence/absence correctly in 17 of 25 cases (68.0%), Gemini 2.5 Pro in 16 of 25 (64.0%), JR in 20 of 25 (80.0%), and SR in 22 of 25 (88.0%). Differences between the LLMs were not statistically significant (p > 0.05), though SR again achieved the highest performance (p < 0.01).

Discussion

One of the most striking findings of our study is that LLMs have great text-based proficiency in LUS. High text-based performance reflects LLM’s strength in codified knowledge. The very high MCQ scores align with prior studies that LLMs excel at guideline-anchored or rule-based tasks in radiology, including report structuring, translation, and standardized knowledge queries [1–8]. LUS fundamentals—definitions of sliding, seashore/barcode patterns, artifact profiles, and effusion/consolidation hallmarks—are well described and relatively stable across consensus documents and core texts, which likely improved LLM success secondary to the training dataset of models [9–12,14,15].
The models’ ability to distinguish normal vs. pathology and to detect pleural effusion likely results from image features with high contrast of pixels, which faciliates the analyses for LLMs: effusion forms an anechoic space with reproducible ancillary signs (spine sign, “jellyfish”), and departure from a normal A-line pattern is visually salient [9–12,16–18]. Our study is unique in that it evaluates the image-based performance of multimodal LLMs.
Moderate performance for consolidation and B-lines underscores one of the limitations for the clinical utility of current multimodal LLMs. Detecting consolidation and B-lines demands a nuanced interpretation of artifact behavior (dynamic vs. static air bronchograms, pleural line irregularity, coalescent vertical artifacts that erase A-lines and move with sliding) and is sensitive to acquisition variability (probe frequency, depth, gain, insonation angle) and patient factors (obesity, subcutaneous emphysema) [9–12,14,15,19,20]. Previous studies in other subjects report that LLMs often match radiologists on text- based tasks but are deficient in image-based evaluation, supporting our finding that general-purpose multimodal stacks remain less reliable for distinct pathology characterization without domain-specific tuning [5–8,19–21].
No significant difference between LLMs suggests convergent capabilities. The absence of a performance gap between ChatGPT-5 and Gemini 2.5 Pro across endpoints may reflect convergent training strategies and similar classes of vision encoders. Their perspectives emphasize that while multimodal LLMs can “see,” they are not yet optimized for artifact-heavy modalities like ultrasound; targeted fine-tuning on curated LUS corpora and protocolized capture may narrow this gap [3, 4, 20, 21].
Within established frameworks, triage often hinges on global abnormality and effusion detection—areas where our LLMs performed well [9–12]. However, specific characterization, diagnosis, and management decisions (e.g., distinguishing pneumonia from atelectasis, characterizing interstitial patterns) require reliable consolidation and B-line assessment; until multimodal models improve, expert oversight remains essential. Standardized acquisition and interpretation checklists, combined with LLM-assisted education and quality control, could enhance real-world utility [9–12,17–22].

Limitations

First, the dataset—while curated to span typical and key pathologies—may not capture the full heterogeneity of LUS across clinical practice. Second, LLMs evolve rapidly; these findings reflect models at a specific point in time and may not generalize to future versions. Finally, the patient’s clinical vignette and chest X-rays are also crucial for interpreting LUS images; however, since our dataset does not include clinical vignettes or chest X-rays of the patients, no fictional clinical vignette was created to avoid bias. The lack of clinical vignettes and chest X-rays of the patients may have reduced the performance of LLMs. Further studies with new datasets that include clinical vignettes are required to support this hypothesis.

Conclusion

In conclusion, LLMs have great potential, promising adjunctive tools to support education, triage, and assessment in lung ultrasound, still need expert oversight remains indispensable for clinical utility.

References

  1. Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930-40. doi:10.1038/s41591-023-02448-8.
  2. Kim S, Lee CK, Kim SS. Large language models: a guide for radiologists. Korean J Radiol. 2024;25(2):126-33. doi:10.3348/kjr.2023.0997.
  3. Sorin V, Barash Y, Konen E, Klang E. Large language models for oncological applications. J Cancer Res Clin Oncol. 2023;149(11):9505-8. doi:10.1007/ s00432-023-04824-w.
  4. Meddeb A, Lüken S, Busch F, et al. Large language model ability to translate CT and MRI free-text radiology reports into multiple languages. Radiology. 2024;313(3):e230922. doi:10.1148/radiol.241736.
  5. Horiuchi D, Tatekawa H, Oura T, et al. ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology. Eur Radiol. 2024;34(1):506-16. doi:10.1007/s00330-024-10902-5.
  6. Sun SH, Chen K, Anavim S, et al. Large language models with vision on diagnostic radiology board exam style questions. Acad Radiol. 2025;32(5):3096-102. doi:10.1016/j.acra.2024.11.028.
  7. Zaki HA, Aoun A, Munshi S, Abdel-Megid H, Nazario-Johnson L, Ahn SH. The application of large language models for radiologic decision making. J Am Coll Radiol. 2024;21(7):1072-8. doi:10.1016/j.jacr.2024.01.007.
  8. Takita H, Walston SL, Mitsuyama Y, Watanabe K, Ishimaru S, Ueda D. Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan. Jpn J Radiol. 2025;43(9):1445-55. doi:10.1007/s11604-025-01799-1.
  9. Lichtenstein DA, Mézière GA. Relevance of lung ultrasound in the diagnosis of acute respiratory failure: the BLUE protocol. Chest. 2008;134(1):117-25. doi:10.1378/chest.07-2800.
  10. Volpicelli G, Elbarbary M, Blaivas M, et al. International evidence-based recommendations for point-of-care lung ultrasound. Intensive Care Med. 2012;38(4):577-91. doi:10.1007/s00134-012-2513-4.
  11. Buda N, Mendrala K, Skoczyński S, et al. Basics of point-of-care lung ultrasonography. N Engl J Med. 2023;389(21):e44. doi:10.1056/NEJMvcm2108203.
  12. Soldati G, Smargiassi A, Inchingolo R, et al. Proposal for international standardization of the use of lung ultrasound for patients with COVID-19: a simple, quantitative, reproducible method. J Ultrasound Med. 2020;39(7):1413-9. doi:10.1002/jum.15285.
  13. Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Radiology. 2015;277(3):826-32. doi:10.1148/radiol.2015151516.
  14. Jenssen C, Tuma J, Möller K, et al. Artefakte in der Sonografie und ihre Bedeutung für die internistische und gastroenterologische Diagnostik - Teil 2: Artefakte im Farb- und Spektraldoppler [Ultrasound artifacts and their diagnostic significance in internal medicine and gastroenterology - part 2: color and spectral Doppler artifacts]. Z Gastroenterol. 2016;54(6):569-78. doi:10.1055/s-0042-103248.
  15. Reissig A, Copetti R, Mathis G, et al. Lung ultrasound in the diagnosis and follow-up of community-acquired pneumonia. Respiration. 2014;87(3):179-89. doi:10.1159/000357449.
  16. Al Deeb M, Barbic S, Featherstone R, Dankoff J. Point-of-care ultrasonography for the diagnosis of acute cardiogenic pulmonary edema: a systematic review and meta-analysis. Acad Emerg Med. 2014;21(8):843-52. doi:10.1111/acem.12435.
  17. Staub LJ, Biscaro RRM, Maurici R. Accuracy and applications of lung ultrasound to diagnose ventilator-associated pneumonia: a systematic review. J Intensive Care Med. 2018;33(8):447-55. doi:10.1177/0885066617737756.
  18. Balik M, Plasil P, Waldauf P, et al. Ultrasound estimation of volume of pleural fluid in ventilated patients. Intensive Care Med. 2006;32(2):318-21. doi:10.1007/ s00134-005-0024-2.
  19. Li CP, Jakob J, Menge F, Reißfelder C, Hohenberger P, Yang C. Comparing ChatGPT-3.5 and ChatGPT-4’s alignment with the German evidence-based S3 guideline for adult soft-tissue sarcoma. iScience. 2024;27(12):111493. doi:10.1016/j.isci.2024.111493.
  20. Wang Y, Wu X, Carlson L, Oniani D. Generative AI enhanced with NCCN clinical practice guidelines for clinical decision support: a case study on bone cancer. J Clin Oncol. 2024;42(16_suppl): e13623-e13623. doi:10.1200/JCO.2024.42.16_suppl.e13623.
  21. Monroe CL, Abdelhafez YG, Atsina K, Aman E, Nardo L, Madani MH. Evaluation of responses to cardiac imaging questions by the large language model ChatGPT. Clin Imaging. 2024;112:110193. doi:10.1016/j.clinimag.2024.110193.
  22. Eren Ç, Turay C, Celal GY. A comparative study: can large language models beat radiologists on PI-RADSv2.1-related questions? J Med Biol Eng. 2024;44:821-30. doi:10.1007/s40846-024-00914-3.

Declarations

Scientific Responsibility Statement

The authors declare that they are responsible for the article’s scientific content, including study design, data collection, analysis and interpretation, writing, and some of the main line, or all of the preparation and scientific review of the contents, and approval of the final version of the article.

Animal and Human Rights Statement

All procedures performed in this study were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Funding

None

Conflict of Interest

The authors declare that there is no conflict of interest.

Data Availability

The datasets used and/or analyzed during the current study are not publicly available due to patient privacy reasons but are available from the corresponding author on reasonable request.

Additional Information

Publisher’s Note
Bayrakol MP remains neutral with regard to jurisdictional and institutional claims.

Rights and Permissions

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). To view a copy of the license, visit https://creativecommons.org/licenses/by-nc/4.0/

About This Article

How to Cite This Article

Eren Çamur, Turay Cesur, Murathan Köksal. Assessing the diagnostic competence of large language models in lung ultrasound through text and image-based evaluation. Ann Clin Anal Med 2025; DOI: 10.4328/ACAM.22956

Publication History

Received:
October 22, 2025
Accepted:
November 24, 2025
Published Online:
December 8, 2025