Comparative evaluation of three artificial intelligence chatbots in providing information about vaccines – Annals of Clinical and Analytical Medicine

Authors

Berfin Özgökçe Özmen¹, Nihan Özel Erçel²

Affiliations

¹Department of Pediatrics, University of Health Sciences, Mersin City Education and Research Hospital, Mersin, Türkiye.

²Department of Biostatistics and Medical Informatics, Mersin University Faculty of Medicine, Mersin, Türkiye.

Corresponding Author

Berfin Özgökçe Özmen

dr.b.ozmen@hotmail.com

+90 507 306 88 30

Abstract

AimVaccine hesitancy remains a major global public health challenge, largely driven by misinformation and declining confidence in vaccine safety and efficacy. With the increasing use of artificial intelligence (AI)–based conversational agents as sources of health information, concerns have emerged regarding the accuracy, completeness, and consistency of vaccine-related information provided by these tools. This study aimed to comparatively evaluate the quality and accuracy of vaccine-related information generated by three widely used AI chatbots—ChatGPT, Gemini, and DeepSeek.
MethodsA predefined set of standardized vaccine-related questions addressing vaccine safety, efficacy, side effects, and common misconceptions was posed to each chatbot. Responses were independently evaluated by multiple reviewers using predefined accuracy categories. Inter-model agreement was assessed using Cohen’s kappa coefficient to determine consistency between chatbot responses.
ResultsAI chatbots generally provided clear and informative responses to vaccine-related questions. However, notable variations were observed in the accuracy, completeness, and depth of information across models. While some responses were fully aligned with established scientific evidence, others were partially incomplete or oversimplified. Agreement levels between chatbots ranged from low to moderate, indicating variability in how vaccine-related information was generated.
ConclusionAI-based chatbots show potential as supportive tools for vaccine communication; however, variability in response quality and inter-model inconsistency raise concerns regarding their reliability as standalone information sources. Expert oversight and alignment with evidence-based public health guidance are essential for their responsible use.

Keywords

vaccine hesitancy artificial intelligence chatbots vaccine information accuracy health communication public health

Introduction

Vaccination is widely recognized as one of the most effective public health interventions for preventing infectious diseases and reducing global morbidity and mortality. Despite its well-established effectiveness, vaccine hesitancy has emerged as a growing public health concern, largely driven by misinformation, distrust, and declining confidence in vaccine safety and efficacy. The World Health Organization has identified vaccine hesitancy as one of the major global health threats and has emphasized the importance of providing accurate, transparent, and evidence-based information to sustain public confidence in immunization programs.¹
The rapid expansion of digital communication platforms has fundamentally transformed how individuals access health-related information. In this evolving information environment, artificial intelligence (AI)-based conversational agents, commonly referred to as chatbots, have become increasingly prominent sources of health guidance. Large language models such as ChatGPT are now frequently consulted by the public for vaccine-related questions, including concerns about safety, side effects, and common misconceptions. However, important concerns remain regarding whether AI-generated responses are consistently accurate, sufficiently comprehensive, and aligned with established scientific evidence.²
Recent studies have evaluated the reliability of ChatGPT as an information source in the context of vaccine and statin hesitancy. While some findings suggest that ChatGPT can provide generally accurate and understandable explanations, other investigations indicate that responses may at times be incomplete, oversimplified, or lacking contextual nuance.³ Such limitations may inadvertently contribute to misunderstanding or reinforce existing uncertainties. These findings raise important questions about the appropriateness of relying solely on AI-based chatbots for vaccine-related information.
Beyond evaluations of individual models, systematic reviews have examined the broader application of AI-based chatbots in healthcare settings. These reviews suggest that chatbots may enhance access to health information and improve patient engagement. Nevertheless, they also highlight substantial variability in response quality and a lack of standardized evaluation frameworks. This variability is particularly concerning in sensitive domains such as vaccination, where inaccurate or inconsistent information may adversely influence public attitudes and health behaviors.³
Furthermore, emerging evidence indicates that AI-driven communication tools may influence vaccine-related knowledge, attitudes, and behavioral intentions. A recent systematic review and meta-analysis demonstrated that chatbot-based interventions can positively affect vaccination uptake and acceptance. However, the effectiveness of such interventions depends heavily on the accuracy, clarity, and consistency of the information provided, underscoring the need for systematic evaluation of chatbot-generated content.⁴
Despite the expanding literature on AI chatbots in healthcare, direct comparative studies evaluating multiple large language models simultaneously with respect to vaccine-related information quality and response consistency remain limited. In particular, robust head-to-head comparisons of ChatGPT, Gemini, and DeepSeek using standardized vaccine-related questions and structured multi-evaluator assessment frameworks are scarce. Therefore, the aim of this study is to comparatively evaluate the quality and accuracy of vaccine-related information generated by three widely used AI chatbots—ChatGPT, Gemini, and DeepSeek. Responses were assessed by multiple independent evaluators using predefined accuracy categories, and inter-model agreement was analyzed using Cohen’s kappa coefficient. By identifying the relative strengths and limitations of these AI systems, this study seeks to contribute to the responsible, evidence-based integration of AI chatbots into vaccine communication and public health practice.

Materials and Methods

Study DesignThis study was designed as a cross-sectional, observational analysis to evaluate the accuracy, completeness, and consistency of responses provided by three artificial intelligence-based chatbots (ChatGPT, Google Gemini, and DeepSeek) to vaccine-related questions. The study focused on assessing the quality of information provided by these chatbots in a public health field where scientific accuracy is crucial.
Data CollectionTo ensure consistency, each question was asked separately to ChatGPT, Google Gemini, and DeepSeek using the same wording. All responses were generated on February 28, 2026, between 21:00 and 22:00 using the latest available versions of ChatGPT (OpenAI), Gemini (Google), and DeepSeek. To minimize potential variability related to model updates, all chatbot responses were obtained within the same predetermined time frame. No follow-up questions or clarifications were made, and only the first responses generated by each chatbot were included in the analysis.
Evaluation of Chatbot ResponsesThe accuracy and completeness of the chatbots' responses were independently evaluated by three experts: a professor of pediatrics, a pediatric infectious disease specialist, and a public health expert experienced in vaccination practices. To minimize potential bias, a blinded evaluation approach was used, in which the identities of the chatbots were concealed from the evaluators. Each response was evaluated according to current international vaccination guidelines and classified into four predefined categories based on scientific accuracy and clinical relevance:
(1) completely correct and comprehensive,
(2) partially correct but incomplete,
(3) misleading or partially incorrect,
(4) completely incorrect or irrelevant.
This classification framework was used to ensure consistency and reproducibility across evaluators (Table 1).
Resolving DiscrepanciesIn cases of disagreement among evaluators, responses were re-examined through consensus-based discussions to minimize subjective bias and ensure internal consistency in the evaluation process. Final classifications were determined after agreement was reached among evaluators.
Ethical ApprovalThis study did not involve human participants, patient data, or biological materials. All analyzed data consisted of publicly available responses generated by AI-based chatbots. Therefore, ethical committee approval was not required. The study was conducted in accordance with general ethical principles for the investigation and responsible evaluation of AI applications.
Statistical Analysis
Cohen's kappa coefficient was calculated to assess response consistency among the three chatbots. The distribution of responses was presented using descriptive statistics. Responses provided by each chatbot were summarized by number and percentage; the proportion of “Completely correct/appropriate” responses was considered the primary performance indicator. All statistical analyses were performed using IBM SPSS Statistics for Windows, Version 26.0 (IBM Corp., Armonk, NY, USA).
Reporting GuidelinesThis cross-sectional comparative study was conducted and reported in accordance with the STROBE guidelines.

Results

The detailed item-based scoring of responses provided by ChatGPT, Gemini, and DeepSeek across 20 standardized questions is presented in Supplementary Table 1. Minor discrepancies between evaluators were observed for certain questions, particularly within the partially correct/incomplete and misleading/partially incorrect categories for Gemini and DeepSeek. In contrast, ChatGPT responses were classified consistently across evaluators for most items. The detailed data is provided in Supplementary Table 1
The distribution of chatbot responses according to each evaluator is summarized in Supplementary Table 2. According to the first evaluator, 80% of ChatGPT responses were classified as completely correct/adequate and 20% as partially correct/incomplete, whereas the corresponding fully correct rates for Gemini and DeepSeek were 60%. The second evaluator rated 85% of ChatGPT responses as completely correct/adequate and 15% as partially correct/incomplete, with Gemini and DeepSeek each demonstrating a 70% fully correct rate. The third evaluator classified all ChatGPT responses (100%) as completely correct/adequate. For Gemini, 35% were completely correct/adequate, 55% partially correct/incomplete, and 10% misleading/partially incorrect. For DeepSeek, 10% were completely correct/adequate, 80% partially correct/incomplete, and 10% misleading/partially incorrect. No responses were categorized as completely incorrect/irrelevant by any evaluator. The detailed data is provided in Supplementary Table 2
The combined overall distribution of scores is provided in Table 1. When all evaluators’ scores were aggregated, 88.3% of ChatGPT responses were classified as completely correct/adequate and 11.7% as partially correct/incomplete. For Gemini, 55% were completely correct/adequate, 38.3% partially correct/incomplete, and 6.7% misleading/partially incorrect. DeepSeek responses were classified as 46.7% completely correct/adequate, 50% partially correct/incomplete, and 3.3% misleading/partially incorrect. No chatbot produced responses categorized as completely incorrect/irrelevant.Inter-model agreement results based on Cohen’s kappa coefficient are presented in Table 2. The agreement between ChatGPT and Gemini was 0.00, indicating no agreement, while the agreement between ChatGPT and DeepSeek was 0.04, also indicating no agreement. In contrast, the agreement between Gemini and DeepSeek was 0.45, reflecting moderate agreement. Overall agreement patterns are summarized below.

Discussion

The emergence of large language models (LLMs) such as ChatGPT, Gemini, and DeepSeek has introduced significant opportunities, as well as important concerns, regarding the presentation of health-related information. In the present study, the performance of these three chatbots in providing vaccine-related information was comparatively evaluated. Our findings demonstrate notable differences among the models in terms of response accuracy, adequacy, and consistency, suggesting that LLMs should not be regarded as interchangeable tools in health communication.^6,7
Specifically, ChatGPT generated a higher proportion of responses categorized as “completely correct and sufficient” compared with Gemini and DeepSeek. This observation aligns with previous reports indicating that ChatGPT may provide more structured, coherent, and context-sensitive responses in medical and public health domains.^8,9
Previous evaluations in the context of vaccination have indicated that ChatGPT can generate scientifically grounded, evidence-based explanations addressing common misconceptions. However, several studies have also noted that responses may at times be overly generalized or lack contextual nuance.^10,11
In contrast, the higher proportion of “partially correct/incomplete” responses observed in Gemini and DeepSeek suggests greater variability in how these models prioritize and frame information, particularly in sensitive and continuously evolving topics such as vaccination. Comparative analyses examining the alignment of LLM responses with World Health Organization and national vaccination guidelines have similarly reported meaningful differences among models, not only in terms of factual accuracy but also regarding emphasis, framing, and presentation style.^4,12
These findings indicate that different artificial intelligence models may produce clinically relevant differences in output, even when trained on broadly similar datasets. In the present study, the assessment of inter-model agreement using Cohen’s kappa coefficient represents an important methodological contribution. The absence of meaningful agreement between ChatGPT and Gemini, as well as between ChatGPT and DeepSeek, demonstrates that these models generate categorically distinct responses to identical vaccine-related questions. In contrast, the moderate level of agreement observed between Gemini and DeepSeek suggests potential similarities in response-generation strategies or underlying algorithmic approaches.¹³
These findings indicate that AI-based chatbots should not be regarded as interchangeable or “one-size-fits-all” tools, and that model selection may be a decisive factor in health communication strategies. From a vaccine communication and public health perspective, the present results carry important clinical and practical implications. Previous studies have demonstrated that chatbot-based interventions can positively influence vaccination intent and health literacy; however, the magnitude and sustainability of this effect depend largely on the accuracy, consistency, and clarity of the information delivered.^14,15
In this context, prioritizing models that generate more reliable and evidence-aligned responses is essential. Furthermore, chatbot-generated outputs should ideally undergo expert review, and these systems should be positioned as supportive decision-aid tools rather than replacements for healthcare professionals.

Limitations

This study has several limitations. First, the evaluation was conducted using a limited set of standardized questions, which may not fully capture the diversity and complexity of vaccine-related inquiries encountered in real-world settings. Second, although inter-rater agreement was statistically analyzed, the categorization of responses was ultimately based on expert assessment and therefore inherently involves a degree of subjective interpretation.¹⁶
Finally, because large language models are continuously updated and refined, the findings of this study reflect the performance of the evaluated models during a specific time frame. Consequently, future updates or algorithmic modifications may influence response quality and consistency.¹⁷

Conclusion

In conclusion, this study demonstrates that ChatGPT, Gemini, and DeepSeek exhibit significant differences in their performance when generating vaccine-related information. These findings indicate that AI-based chatbots should not be considered interchangeable tools in health communication.
Although such systems hold considerable potential for addressing vaccine hesitancy and disseminating public health information, their implementation must be guided by principles of accuracy, transparency, and expert oversight to ensure safe and responsible use. Future multi-center and longitudinal investigations are warranted to further clarify the evolving role, reliability, and limitations of large language models in health communication.¹⁸

Declarations

Ethics Declarations

This study did not involve human participants, patient data, or biological materials. All analyzed data consisted of publicly available responses generated by artificial intelligence–based chatbots.
Therefore, ethical committee approval was not required in accordance with institutional and international research guidelines.

Animal and Human Rights Statement

This study did not involve human participants or animal subjects.

Informed Consent

Informed consent was not required, as the study did not include human participants or identifiable personal data.

Data Availability

The datasets used and/or analyzed during the current study are not publicly available due to patient privacy reasons but are available from the corresponding author on reasonable request.

Conflict of Interest

The authors declare that there is no conflict of interest.

Funding

None.

Author Contributions (CRediT Taxonomy)

Conceptualization: B.Ö.Ö.
Methodology: B.Ö.Ö., N.Ö.E.
Formal analysis: N.Ö.E.
Investigation: B.Ö.Ö.
Data curation: B.Ö.Ö.
Writing – original draft: B.Ö.Ö.
Writing – review & editing: N.Ö.E.
Supervision: N.Ö.E.

Scientific Responsibility Statement

The authors declare that they are responsible for the article’s scientific content, including study design, data collection, analysis and interpretation, writing, and some of the main line, or all of the preparation and scientific review of the contents, and approval of the final version of the article.

Abbreviations

AI: Artificial intelligence
LLM: Large language model
RCT: Randomized controlled trial
STROBE: Strengthening the Reporting of Observational Studies in Epidemiology
κ: Cohen’s kappa coefficient

References

Fiore M, Bianconi A, Acuti Martellucci C, et al. Vaccination hesitancy: agreement between WHO and ChatGPT-4.0 or Gemini advanced. Ann Ig. 2025;37(3):390-396.

PubMed Google Scholar
Deiana G, Dettori M, Arghittu A, et al. Artificial intelligence and public health: evaluating ChatGPT responses to vaccination myths and misconceptions. Vaccines (Basel). 2023;11(7):1217. doi:10.3390/vaccines11071217

Article PubMed Google Scholar
Torun C, Sarmis A, Oguz A. Is ChatGPT an accurate and reliable source of information for patients with vaccine and statin hesitancy? Medeni Med J. 2024;39(1):1-7. doi:10.4274/mmj.galenos.2024.03154

Article PubMed Google Scholar
Chan PS, Fang Y, Cheung DH, et al. Effectiveness of chatbots in increasing uptake, intention, and attitudes related to any type of vaccination: a systematic review and meta-analysis. Appl Psychol Health Well Being. 2024;16(4):2567-2597. doi:10.1111/aphw.12564

Article PubMed Google Scholar
American Academy of Pediatrics Committee on Infectious Diseases. Red book: 2024-2027 report of the committee on infectious diseases. 33rd ed. American Academy of Pediatrics; 2024. doi:10.1542/9781610027373

Article PubMed Google Scholar
De Angelis L, Baglivo F, Arzilli G, et al. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Front Public Health. 2023;11:1166120. doi:10.3389/fpubh.2023.1166120

Article PubMed Google Scholar
Iqbal U, Tanweer A, Rahmanti AR, et al. Impact of large language model (ChatGPT) in healthcare: an umbrella review and evidence synthesis. J Biomed Sci. 2025;32(1):45. doi:10.1186/s12929-025-01131-z

Article PubMed Google Scholar
Joshi S, Ha E, Amaya A, et al. Ensuring accuracy and equity in vaccination information from ChatGPT and CDC: mixed-methods cross-language evaluation. JMIR Form Res. 2024;8:e60939. doi:10.2196/60939

Article PubMed Google Scholar
Shapiro GK, Tatar O, Dube E, et al. The vaccine hesitancy scale: psychometric properties and validation. Vaccine. 2018;36(5):660-667. doi:10.1016/j.vaccine.2017.12.043

Article PubMed Google Scholar
Passanante A, Pertwee E, Lin L, et al. Conversational AI and vaccine communication: systematic review of the evidence. J Med Internet Res. 2023;25:e42758. doi:10.2196/42758

Article PubMed Google Scholar
Hong YJ, Piao M, Kim J, et al. Development and evaluation of a child vaccination chatbot real-time consultation messenger service during the COVID-19 pandemic. Appl Sci (Basel). 2021;11(24):12142. doi:10.3390/app112412142

Article PubMed Google Scholar
Cosma C, Radi A, Cattano R, et al. Exploring chatbot contributions to enhancing vaccine literacy and uptake: a scoping review of the literature. Vaccine. 2025;44:126559. doi:10.1016/j.vaccine.2024.126559

Article PubMed Google Scholar
Hou Z, Wu Z, Qu Z, et al. A vaccine chatbot intervention for parents to improve HPV vaccination uptake among middle school girls: a cluster randomized trial. Nat Med. 2025;31(6):1855-1862. doi:10.1038/s41591-025-03618-6

Article PubMed Google Scholar
Issa AN, Wodi AP, Moser CA, Cineas S. Advisory committee on immunization practices recommended immunization schedule for children and adolescents aged 18 years or younger—United States, 2025. MMWR Morb Mortal Wkly Rep. 2025;74(2):26-29. doi:10.15585/mmwr.mm7402a2

Article PubMed Google Scholar
McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb). 2012;22(3):276-282. doi:10.11613/bm.2012.031

Article PubMed Google Scholar
Koh MCY, Ngiam JN, Salada BMA, et al. Can ChatGPT counter vaccine hesitancy? An evaluation of ChatGPT's responses to simulated queries from the general public. Healthcare (Basel). 2025;13(11):1269. doi:10.3390/healthcare13111269

Article PubMed Google Scholar
Kim S, Kim K, Jo CW. Accuracy of a large language model in distinguishing anti- and pro-vaccination messages on social media: the case of human papillomavirus vaccination. Prev Med Rep. 2024;42:102723. doi:10.1016/j.pmedr.2024.102723

Article PubMed Google Scholar
World Health Organization. European immunization agenda 2030. World Health Organization; 2021.

PubMed Google Scholar

Additional Information

Publisher’s Note
Bayrakol MP remains neutral with regard to jurisdictional and institutional claims.

Rights and Permissions

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). To view a copy of the license, visit https://creativecommons.org/licenses/by-nc/4.0/

About This Article

Received:: February 18, 2026
Accepted:: April 24, 2026
Published Online:: May 1, 2026