Skip to content
← Back to Publish Online

Evaluation of artificial intelligence chatbots in answering questions about unintentional weight loss: a comparative study

AI chatbots on unintentional weight loss

Research Article DOI: 10.4328/ACAM.22948

Authors

Affiliations

1Department of Internal Medicine, Faculty of Medicine, Sivas Cumhuriyet University, Sivas, Turkey

Corresponding Author

Abstract

Aim Our study aims to evaluate the accuracy comparatively reliability of the responses provided by artificial intelligence (AI) chatbots ChatGPT, DeepSeek, and Gemini to clinical questions regarding a symptom with a multidisciplinary and broad differential diagnosis, such as unintentional weight loss (UWL).
Materials and Methods 129 clinical questions, definitions, symptomatology, differential diagnosis, diagnostic approach, treatment and management, and patient questions compiled from various health websites, books, and guides were categorized under six main headings and directed to three different AI chatbots. Each chatbot was asked to score the difficulty level of each question, and the relationship between difficulty assessments and accuracy performance was analyzed. Each response was evaluated by three internal medicine specialists and scored 1–4 based on accuracy.
Results ChatGPT and DeepSeek demonstrated similar performance with high accuracy rates, while Gemini performed at a significantly lower accuracy level. Significant differences were observed between the chatbots in five of the six question groups (p < 0.05). Most of these differences stemmed from Gemini’s poor performance. No significant difference was observed in the treatment and management question group (p = 0.124). No significant relationship was found between question difficulty level and chatbot accuracy rates (p > 0.05).
Discussion While ChatGPT and DeepSeek offer high accuracy and reliability, Gemini has performed below these two AI chatbots. Our findings indicate that AI chatbots should not be used as standalone tools for diagnosis, treatment, and management in complex clinical decision-making processes. However, they can be considered an important complementary tool for rapid access to accurate information and supporting clinical decisions.

Keywords

artificial intelligence ChatGPT DeepSeek Gemini unintended weight loss

Introduction

Unintended weight loss (UWL) is a topic that sometimes challenges internal medicine specialists. This term is used to describe situations where weight loss occurs without the patient’s conscious effort and is not an expected result of chronic disease or drug treatment [1]. It is defined as a loss of at least 5% of the patient’s normal weight over the past 6-12 months. Its prevalence ranges from 7% to 13%, reaching up to 20% in individuals over the age of 65 [2]. There are many conditions involved in the etiology of UWL. The most common causes are malignancies, endocrine disorders, non-malignant gastrointestinal diseases, and psychiatric disorders. Despite detailed diagnostic evaluations, a diagnosis cannot be made in one-quarter of these patients. Therefore, early recognition and detailed evaluation of UWL play a critical role in reducing morbidity and mortality [1, 3].
Recent developments in artificial intelligence (AI) have paved the way for AI-powered chatbots to be used as a source of medical information [4]. In fact, guidelines have even been created regarding the use of chatbots in this regard [5]. However, the information provided by AI chatbots may not always be accurate and reliable. Particularly in a multidisciplinary and detailed assessment-requiring field such as UWL, it may mislead the patient or physician and cause unnecessary anxiety. This can delay diagnosis, thereby increasing morbidity and mortality [6]. Although there are studies in the literature examining the performance of AI chatbots in answering medical questions, research focusing on symptom-based and particularly complex topics, such as UWL, is limited [7, 8].
This study aims to comparatively evaluate the responses provided by different AI chatbots to questions about unintentional weight loss in terms of accuracy, comprehensiveness, appropriateness, and clinical safety. The data obtained from the study will evaluate the performance of AI chatbot conversation systems used in medical information sharing, thereby establishing an evidence-based foundation for the appropriate use of these technologies in clinical counseling applications.

Materials and Methods

All questions frequently asked by patients on health websites, internal medicine, endocrinology, gastroenterology websites, and social media (Facebook, Instagram, and Twitter) were recorded. Health website selection was performed as follows: When the subtopic to be evaluated was searched on Google, frequently asked questions on the first patient information sites that appeared were evaluated. Recommendations in the form of strong advice on every topic related to UWL in UpToDate and Harrison’s Principles of Internal Medicine were converted into questions. Repeated questions, questions that were grammatically incorrect, questions related to personal health, and questions with unclear answers were excluded from the study. All questions included in the study were created by three experienced internal medicine specialists. These questions were ultimately categorized as follows:
• Definitions and Concepts
• Symptomatology
• Differential Diagnosis
• Diagnostic Approach
• Treatment and Management
• Patient Questions
ChatGPT 4.0 (OpenAI, USA), Deepseek-R1 (DeepSeek-AI, China), and Gemini 2.5 Flash (Google DeepMind, USA) were used to respond in internet browsers with cookies and history cleared. The responses provided by these AI chatbots were evaluated and scored by three internal medicine specialists. The authors assigned each response a score from 1 to 4: Responses that correctly included all the information an internist should provide to a patient were rated 1 (completely correct); if correct but insufficient, 2; if a mixture of correct and misleading information, 3; and if completely incorrect, 4. Questions based on Harrison’s Principles of Internal Medicine and UpToDate’s strong recommendation level were evaluated for their alignment with guideline information. Scoring discrepancies among the three reviewers were resolved by taking the arithmetic mean, and sequentially scored responses were jointly reviewed by the reviewers and evaluated against primary sources. The responses given by AI chatbots to all questions were created by submitting the same questions from different IP addresses after approval, thus confirming the reproducibility of the responses. We asked the AI chatbots the difficulty level of each question and classified them according to their responses. Based on the responses, all questions were categorized as easy, medium, or difficult.
The data were analyzed using IBM SPSS Statistics 26.0 (IBM Corp., Armonk, NY, USA) software. The normality of distribution was assessed using the Kolmogorov–Smirnov test, skewness, and kurtosis coefficients; histograms and Q–Q plots were also visually reviewed. Since the scores for all three models were ordinal and did not meet the assumption of normal distribution, non-parametric methods were preferred over parametric tests. Descriptive statistics were calculated to determine the mean, standard deviation, minimum, and maximum values. The Friedman test was applied to compare the differences between the scores given by the three models to the same questions. The significance level was set at p<0.05. In groups where differences were detected by the Friedman test, the Wilcoxon Signed-Rank test was used for post-hoc analysis. The Z-statistic and p-value were calculated for each test, and multiple comparison error was controlled by applying the Bonferroni correction. According to the Bonferroni correction, the significance threshold was set at p<0.017 for three comparisons. The Friedman test was applied within the six subgroups of questions, and pairwise comparisons were performed using the Wilcoxon test in groups where differences were detected. Thus, the performance differences of AI chatbots in specific subject areas were tested. In addition, the Spearman rank correlation test was used to evaluate the relationship between the scores of these AI chatbots and the difficulty levels of the questions. A positive correlation indicates that as the difficulty level increases, the score also increases, meaning that errors increase; a negative correlation indicates that as the difficulty increases, the score decreases, meaning that the accuracy rate increases. The confidence level was set at 95%, and the significance level at p<0.05.
Ethical Approval
Since ChatGPT, DeepSeek, Gemini chatbots, and Harrison’s Principles of Internal Medicine, UpToDate books, and guide information are publicly available, no ethical approval was obtained for this study.

Results

After applying exclusion criteria, a total of 129 questions remained for evaluation in our study. ChatGPT provided fully correct answers to 87.6% of the questions, DeepSeek to 85.3%, and Gemini to 41.1%. Despite this low rate, Gemini partially answered 55.8% of the questions correctly. ChatGPT had the highest accuracy with the lowest average score of 1.12. DeepSeek had a similar accuracy with an average score of 1.19. Gemini had the lowest accuracy with an average of 1.62. The difficulty level averages were similar, with ChatGPT at 2.27±0.72, DeepSeek at 2.40±0.59, and Gemini at 2.44 ± 0.59. The difficulty level was mostly medium. Detailed analysis results are provided in Tables 1 and 2.
There was a statistically significant difference in the accuracy of the responses provided by the AI chatbots (p < 0.001). In the subsequent post-hoc analysis, no significant difference was found between ChatGPT and DeepSeek (p = 0.267). Gemini performed worse than the other chatbots (p < 0.001).
The questions were divided into 6 groups. The Differential Diagnosis group had a maximum of 40 questions, while the symptomatology and treatment, and management group had a minimum of 14 questions. When evaluating the responses given by AI chatbots to the question subgroups, significant differences were detected in all groups except the treatment and management question group. The differences were generally found to stem from Gemini scoring higher, meaning it made more errors. Detailed analysis results related to these findings are provided in Tables 1, 2, and 3.
The relationship between the answers provided by YZ chatbots and the difficulty level was examined, and no significant correlation was found. Spearman correlation coefficients were found to be ρ=−0.124 (p = 0.592) for ChatGPT, ρ=−0.141 (p = 0.541) for DeepSeek, and ρ=0.216 (p = 0.347) for Gemini. This result indicates that there is no significant change in accuracy as questions become more difficult and that the performance of these chatbots is independent of difficulty.

Discussion

Our study comparatively evaluated the responses provided by three different AI chatbots to clinical questions about UWL in terms of accuracy, comprehensiveness, and reliability. According to our results, ChatGPT and DeepSeek performed at a similar level in terms of accuracy, while Gemini had significantly lower accuracy. In addition, the performance of these AI chatbots was evaluated separately in the question groups we created, and significant differences were observed in the answers given to most question groups. Again, this difference was largely attributed to Gemini’s low accuracy rate. However, no significant difference was found in the treatment and management question group. This suggests that AI chatbots may have a limited but guiding level of shared knowledge base in clinical decision-making processes. No significant relationship was found between the difficulty level of the questions and the performance of the AI chatbots.
UWL is a symptom that clinicians frequently encounter. Its differential diagnosis includes many conditions such as malignancies, endocrine disorders, and psychiatric illnesses, making it difficult to detect serious diseases associated with mortality and morbidity at an early stage [9]. The current lack of valid guidelines and the failure to identify the underlying cause for an extended period of time complicate the diagnostic process [10].
In recent years, with the widespread use of AI chatbots in the healthcare sector, they have begun to be used in patient education, pre-triage, and clinical decision-making processes. Studies on this topic have increased in the literature in recent years. In a study conducted by Ayer et al., it was observed that ChatGPT provided medical responses of similar quality to those of physicians [11]. In a study related to cardiovascular pharmacology, ChatGPT also provided largely accurate answers to questions [12]. In a study using real patient data related to internal medicine, ChatGPT performed well, and it was emphasized that it could be used as a supportive diagnostic tool [13]. In two recent studies examining ChatGPT’s responses to questions about urological diseases, ChatGPT’s answers were found to be acceptable and highly accurate [14, 15]. In contrast to these studies, studies are showing that the medical information provided by ChatGPT is of limited quality [6, 8, 16]. In fact, Goodman et al. reported that approximately one- third of ChatGPT’s responses to clinician questions contained incomplete, misleading, or off-target information [8]. Our study is generally consistent with the literature and shows that ChatGPT can often provide accurate and clinically meaningful information and can be used as a supportive tool in the clinic. Our evaluation of a complex symptom such as UWL, which requires a multidisciplinary approach, contributes significantly to the literature by examining the performance of AI chatbots across a wide clinical spectrum. However, due to a slight decrease in performance in some question subgroups, it should be used as a supportive tool for physicians rather than a direct diagnostic tool.
Another AI chatbot evaluated in our study is DeepSeek, which has demonstrated similar performance to ChatGPT. Temsah et al. emphasized that DeepSeek performs well in the medical field and has the potential for healthcare innovation [17]. Similarly, another study highlighted DeepSeek’s high accuracy rate and emphasized that it is sufficient for clinically consistent reasoning. However, it was recommended that it be used under clinician supervision because errors can be found in its very long responses[18]. Studies comparing DeepSeek and ChatGPT are available in the literature. These comparative studies show different results. In general, the performance of both chatbots was high, similar to our study [19, 20]. In a study by Gurbuz et al., ChatGPT performed better than DeepSeek. However, DeepSeek stood out in some of the subgroups of the study [20]. In another study, DeepSeek performed better, particularly in terms of referencing [19]. Consistent with the literature, DeepSeek performed similarly to ChatGPT in our study. The lowest accuracy rate was observed in the treatment and management section. This indicates that DeepSeek has more limited knowledge regarding current guideline-based treatment information compared to other question groups. Based on these results, while DeepSeek is strong in areas such as information analysis and classification, it lags behind in advanced clinical decision-making stages. Therefore, it may be considered suitable to guide clinicians, but not to be used alone in areas such as treatment and management.
In our study, Gemini’s performance was found to be significantly lower than the other two AI chatbots. Reviewing the literature, a study conducted on cardiovascular pharmacology questions showed that Gemini’s performance lagged considerably behind ChatGPT, with its responses often remaining superficial and failing to reflect clinical nuances [12]. Similarly, in a study on responses to radiological patient questions, although the answers provided were acceptable, they lagged behind the other two AI chatbots, and this study also found them to be inadequate in providing detailed clinical explanations [21]. Another comparative study also found Gemini to be behind ChatGPT, particularly showing reduced performance with complex medical content [22]. Considering that our study focuses on a complex, multidisciplinary topic like UWL that requires a high level of clinical reasoning, it can be concluded, in line with the literature, that Gemini is insufficient for answering complex questions. However, the absence of meaningful differences in treatment and management may indicate that it shares a similar infrastructure with other AI chatbots in terms of basic clinical protocol knowledge. Additionally, our study used the Gemini 2.5 Flash model. It should be noted that studies in the literature reporting results for ‘Gemini’ may have used different versions, which can affect performance. Therefore, our findings apply specifically to Gemini 2.5 Flash. Future studies are recommended to replicate these analyses using other Gemini versions.
Our study was conducted in Turkish. The literature indicates that the performance of AI chatbots can vary across languages. In a two-language comparison, English responses demonstrated higher accuracy than Turkish responses [23]. Other studies have likewise reported higher accuracy for English compared with Japanese, Russian, and Kazakh [24, 25]. Therefore, our findings should be generalized beyond Turkish with caution.
One of the most significant advantages of our study is that it is one of the few studies comparing the performance of three different AI chatbots in a clinical setting, such as UWL, which requires a multifaceted, broad perspective and a high level of reasoning. While previous studies have generally been limited to a single disease and field, our study evaluated AI chatbots in different clinical scenarios by creating six different question groups covering both treatment and patient communication. Furthermore, by using questions compiled from real-world patients and physicians, a high level of alignment with clinical practice was achieved. Another advantage of our study is that it examined the relationship between response level and difficulty level.

Limitations

The first limitation is that chatbots are evaluated within a specific time frame and based on a single version, meaning their performance may change with future updates. Furthermore, since the questions and answers are in Turkish, it should be noted that linguistic performance differences may lead to different results, particularly in widely spoken languages such as English.

Conclusion

Our study involves a comparative evaluation of the accuracy, reliability, and comprehensiveness of responses provided by three different AI chatbots based on a multidimensional symptom such as UWL. According to our findings, ChatGPT and DeepSeek generally performed similarly and at a high level, while Gemini demonstrated lower performance. While DeepSeek’s performance declined, particularly in the areas of treatment and management, ChatGPT provided more consistent results across most question groups. These results indicate that AI chatbots should be considered as complementary and supportive information sources for physicians rather than direct clinical tools.

References

  1. Perera LAM, Chopra A, Shaw AL. Approach to patients with unintentional weightloss.Med Clin North Am. 2021;105(1):175-86. doi:10.1016/j. mcna.2020.08.019.
  2. McMinn J, Steel C, Bowman A. Investigation and management of unintentional weight loss in older adults. BMJ. 2011;342(7800):754-9. doi:10.1136/bmj.d1732.
  3. Wong CJ. Involuntary weight loss. Med Clin North Am. 2014;98(3):625-43. doi:10.1016/j.mcna.2014.01.012.
  4. Miner AS, Laranjo L, Kocaballi AB. Chatbots in the fight against the COVID-19 pandemic. NPJ Digit Med. 2020;3:65. doi:10.1038/s41746-020-0280-0.
  5. Flanagin A, Bibbins-Domingo K, Berkwits M, Christiansen SL. Nonhuman “authors” and implications for the integrity of scientific publication and medical knowledge. JAMA. 2023;329(8):637-9. doi:10.1001/jama.2023.1344.
  6. Çıracıoğlu AM, Dal Erdoğan S. Evaluation of the reliability, usefulness, quality and readability of ChatGPT’s responses on Scoliosis. Eur J Orthop Surg Traumatol. 2025;35(1):123. doi:10.1007/s00590-025-04198-4.
  7. Shiferaw MW, Zheng T, Winter A, Mike LA, Chan LN. Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions. BMC Med Inform Decis Mak. 2024;24(1):1-8. doi:10.1186/s12911-024-02824-5.
  8. Goodman RS, Patrinely JR, Stone CA, et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw Open. 2023;6(10):e2336483-e2336483. doi:10.1001/jamanetworkopen.2023.36483.
  9. Rao S, Kikano EG, Smith DA, Guler E, Tirumani SH, Ramaiya NH. Unintentional weight loss: what radiologists need to know and what clinicians want to know. Abdom Radiol (NY). 2021;46(5):2236-50. doi:10.1007/s00261-020-02908-6.
  10. Gaddey HL, Holder KK. Unintentional weight loss in older adults. Am Fam Physician. 2021;104(1):34-40.
  11. Ayers JW, Poliak A, Dredze M,et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social Media forum. JAMA Intern Med. 2023;183(6):589-96. doi:10.1001/jamainternmed.2023.1838.
  12. Salman IM, Ameer OZ, Khanfar MA, Hsieh YH. Artificial intelligence in healthcare education: evaluating the accuracy of ChatGPT, Copilot, and Google Gemini in cardiovascular pharmacology. Front Med. 2025;12:1495378. doi:10.3389/fmed.2025.1495378.
  13. Hoppe JM, Auer MK, Strüven A, Massberg S, Stremmel C. ChatGPT with GPT-4 outperforms emergency department physicians in diagnostic accuracy: Retrospective analysis. J Med Internet Res. 2024;26:e56110. doi:10.2196/56110.
  14. Ergin İE, Sancı A. Can ChatGPT help patients understand their andrological diseases? Rev Int Androl. 2024;22(2):14-20. doi:10.22514/j.androl.2024.010.
  15. Öztürk A, Ergin İE. Evaluation of ChatGPT’s performance in answering questions about female lower urinary tract symptoms. Andrology Bulletin. 2024;26(3):173-8. doi:10.24898/tandro.2024.53486.
  16. Walker HL, Ghani S, Kuemmerli C, et al. Reliability of medical information provided by ChatGPT: Assessment against clinical guidelines and patient information quality instrument. J Med Internet Res. 2023;25:e47479. doi:10.2196/47479.
  17. Temsah A, Alhasan K, Altamimi I, et al. DeepSeek in healthcare: Revealing opportunities and steering challenges of a new open-source artificial intelligence frontier. Cureus. 2025;17(2):e79221. doi:10.7759/cureus.79221.
  18. Moëll B, Sand Aronsson F, Akbar S. Medical reasoning in LLMs: an in-depth analysis of DeepSeek R1. Front Artif Intell. 2025;8:1616145. doi:10.3389/ frai.2025.1616145.
  19. Kaygisiz ÖF, Teke MT. Can deepseek and ChatGPT be used in the diagnosis of oral pathologies? BMC Oral Health. 2025;25(1):1-8. doi:10.1186/s12903-025- 06034-x.
  20. Gurbuz S, Bahar H, Yavuz U, Keskin A, Karslioglu B, Solak Y. Comparative efficacy of ChatGPT and DeepSeek in addressing patient queries on gonarthrosis and total knee arthroplasty. Arthroplast Today. 2025;33:101730. doi:10.1016/j. artd.2025.101730.
  21. Marey A, Saad AM, Tanas Y, et al. Evaluating the accuracy and reliability of AI chatbots in patient education on cardiovascular imaging: a comparative study of ChatGPT, Gemini, and Copilot. Egypt J Radiol Nucl Med. 2025;56(1):1-10. doi:10.1186/s43055-025-01452-x.
  22. Al-Thani SN, Anjum S, Bhutta ZA, et al. Comparative performance of ChatGPT, Gemini, and final-year emergency medicine clerkship students in answering multiple-choice questions: implications for the use of AI in medical education. Int J Emerg Med. 2025;18(1):1-8. doi:10.1186/s12245-025-00949-6.
  23. İnan B, Karadaş Ö, Odabaşı Z. Evaluation of ChatGPT-4o’s responses to questions about myasthenia gravis in English and Turkish. Med J Bakirkoy. 2025;21(3):310-5. doi:10.4274/BMJ.galenos.2025.2025.5-9.
  24. Ando K, Sato M, Wakatsuki S, et al. A comparative study of English and Japanese ChatGPT responses to anaesthesia-related medical questions. BJA Open. 2024;10:100296. doi:10.1016/j.bjao.2024.100296.
  25. Adilmetova G, Nassyrov R, Meyerbekova A, Karabay A, Varol HA, Chan MY. Evaluating ChatGPT’s multilingual performance in clinical nutrition advice using synthetic medical text: Insights from Central Asia. J Nutr. 2025;155(3):729-35. doi:10.1016/j.tjnut.2024.12.018.

Declarations

Scientific Responsibility Statement

The authors declare that they are responsible for the article’s scientific content, including study design, data collection, analysis and interpretation, writing, and some of the main line, or all of the preparation and scientific review of the contents, and approval of the final version of the article.

Animal and Human Rights Statement

All procedures performed in this study were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Funding

None

Conflict of Interest

The authors declare that there is no conflict of interest.

Acknowledgment

AI Usage
The authors used the AI-based translation tool DeepL solely for English translation and language editing. All scientific content, data interpretation, and conclusions were produced by the authors.

Data Availability

The datasets used and/or analyzed during the current study are not publicly available due to patient privacy reasons but are available from the corresponding author on reasonable request.

Additional Information

Publisher’s Note
Bayrakol MP remains neutral with regard to jurisdictional and institutional claims.

Rights and Permissions

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). To view a copy of the license, visit https://creativecommons.org/licenses/by-nc/4.0/

About This Article

How to Cite This Article

Zekeriya Keskin, Muhammed Faruk Aşkın, Onur Büyüktekeli, Evaluation of artificial intelligence chatbots in answering questions about unintentional weight loss: a comparative study. Ann Clin Anal Med 2025; DOI: 10.4328/ACAM.22948

Publication History

Received:
October 14, 2025
Accepted:
November 18, 2025
Published Online:
December 2, 2025