Evaluation of ChatGPT-generated responses to common patient questions about sacroiliac joint dysfunction: a quality assessment study – Annals of Clinical and Analytical Medicine

Authors

Hatice Zehra Ferhatlar¹, Mert Züre¹

Affiliations

¹Physical Therapy and Rehabilitation, Kanuni Sultan Süleyman Training and Research Hospital, İstanbul, Türkiye.

Corresponding Author

Hatice Zehra Ferhatlar

haticezehraferhatlar@gmail.com

+90 5367387869

Abstract

AimSacroiliac joint dysfunction is a poorly understood condition that can cause chronic low back pain. Therefore, patients frequently turn to online resources, including AI chatbots, for information. The aim of this study is to evaluate the quality, accuracy, and appropriateness of medical information generated by artificial intelligence.
MethodsIn this cross-sectional study, the top 50 most searched queries were identified from Google Trends. 12 irrelevant and 11 duplicate queries were excluded, and the remaining 27 queries were asked to ChatGPT 5.2. The responses were analyzed using the Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease (FRE), Simple Measure of Gobbledygook (SMOG), Ensuring Quality Information for Patients (EQIP), Patient Education Materials Assessment Tool (PEMAT), Journal of the American Medical Association (JAMA) benchmark criteria, and a 5-point Likert scale.
ResultsThe average FKGL of the responses generated by ChatGPT was found to be 11.5 ± 9.1, exceeding the recommended sixth-grade level. The average FRE was 43.8 ± 12.1, indicating a difficult reading level. In quality assessments, the EQIP and PEMAT scores were above the recommended level. The vast majority of responses (77.8–88.9%) were evaluated as error-free by specialist physicians.
ConclusionThis study highlights the potential and challenges of using ChatGPT for patient education regarding sacroiliac joint dysfunction and provides a framework for comprehensive quality assessment of health information generated by artificial intelligence.

Keywords

chatbots ChatGPT sacroiliac dysfunction

Introduction

Sacroiliac joint dysfunction is a commonly overlooked cause of chronic low back pain, affecting around 15-30% of patients with low back pain. The condition results from abnormal motion or malalignment of the sacroiliac joint, which can lead to significant disability and reduced quality of life. Despite its prevalence, sacroiliac joint dysfunction remains challenging to diagnose and manage, often requiring specialized knowledge and experience.^1,2 Given the complexity of this condition and the limited public awareness, patients increasingly turn to online resources, including AI chatbots, to understand their symptoms and treatment options.³
AI has revolutionized how patients access medical information, with ChatGPT increasingly used for health queries.⁴ Therefore, it is necessary to critically evaluate the quality and readability of AI-generated medical content to ensure patient safety, as the quality of patient education materials significantly impacts health outcomes, treatment adherence, and patient satisfaction. Research has consistently shown that most health information available online is written at a level exceeding the recommended sixth-grade level set by the American Medical Association and the National Institutes of Health. Furthermore, patient education materials must not only be readable but also understandable and actionable, enabling patients to make informed decisions about their care.^5,6,7While numerous studies have evaluated AI chatbot performance across various medical specialties, a comprehensive assessment of ChatGPT responses regarding sacroiliac joint dysfunction remains limited.^8,9 This study aimed to systematically evaluate ChatGPT's responses to common questions about sacroiliac joint dysfunction using validated assessment tools and expert review, evaluating the quality, accuracy, and appropriateness of AI-generated patient education materials.

Materials and Methods

Although this research evaluates artificial intelligence-generated content, it does not involve predictive modeling or machine learning model development.
This cross-sectional study was conducted on January 9, 2025, with a systematic search made on Google Trends to identify the most frequently searched questions about sacroiliac dysfunction. On this date, a retrospective analysis of search trends over the previous 20 years (January 9, 2006 – January 9, 2026) was conducted in the health category on Google Trends. Data after the data extraction date was not included in the analysis. The top 50 queries related to sacroiliac dysfunction were initially identified. After excluding 12 irrelevant queries and 11 duplicate queries, a final dataset consisting of 27 unique and relevant queries was obtained for analysis.
Each of the 27 questions was submitted to ChatGPT 5.2, an advanced language model by OpenAI. To reduce bias, every question was asked in a separate chat session using default settings with no chat history or customization. We did not modify temperature, token limits, or other parameters. All responses were collected and stored for later analysis (Appendix 1).
ChatGPT responses were evaluated using validated tools to analyze quality, readability, and credibility. Three widely used readability formulas were applied to evaluate text complexity. The Flesch-Kincaid Grade Level (FKGL) estimates the US grade level required to comprehend the text; recommended healthcare materials target a grade level of 6 or below. The Flesch Reading Ease (FRE) score ranges from 0 to 100, with higher scores indicating easier readability; scores above 60 are considered acceptable for general healthcare information. The Simple Measure of Gobbledygook (SMOG) index was also employed, as it has been validated against 100% comprehension and is considered particularly suitable for healthcare applications.¹⁰ Word count was documented for each response to assess verbosity and information density.
The Patient Education Materials Assessment Tool (PEMAT) for Printable Materials to assess understandability and actionability. The tool checks whether people with diverse backgrounds and literacy levels can process key messages (understandability) and identify the necessary actions (actionability). Scores are percentages; 70% or higher is generally acceptable.¹¹ The Ensuring Quality Information for Patients (EQIP) tool was employed to provide a comprehensive assessment of information quality. EQIP evaluates multiple dimensions, including content clarity, completeness, structure, layout, and identification data. The tool consists of 36 items divided into three main domains: content, structure, and identification. Each item is scored according to specific criteria, yielding both domain-specific and overall quality scores.¹²
The Journal of the American Medical Association (JAMA) benchmark criteria were used to assess the transparency and reliability of information. JAMA benchmarks evaluate four core standards: authorship (authors and contributors with their credentials should be provided), attribution (references and sources for content should be listed clearly), disclosure (sponsorship, advertising, funding arrangements, and potential conflicts of interest should be disclosed), and currency (dates of content posting and updates should be indicated). Each criterion is scored as present or absent, with a maximum score of 4 points.¹³Additionally, two independent expert physicians with specialized knowledge in musculoskeletal disorders evaluated the accuracy of each ChatGPT response using a 5-point Likert scale. The scale ranged as 1 (no inaccuracy), 2 (low level of inaccuracy), 3 (moderate level of inaccuracy), 4 (high level of inaccuracy), and 5 (very high level of inaccuracy). Experts assessed whether the information provided was medically accurate, up to date, and appropriate for patient education. Both experts were blinded to each other's ratings during the initial evaluation phase.
Ethical ApprovalEthics committee approval was not required for this study because no human participants, patient data, or animal subjects were involved.
Statistical AnalysisThe Statistical Package for the Social Sciences statistical program (IBM Corp., Armonk, NY, USA) version 23 was used for data analysis. Inter-rater reliability between the two expert physicians was evaluated using the intraclass correlation coefficient (ICC) with a two-way mixed-effects absolute-agreement model. Descriptive statistics, including means, standard deviations, medians, and interquartile ranges (IQR), were calculated for all assessment measures. The distributions of PEMAT understandability and actionability scores, EQIP total scores, JAMA benchmark criteria, readability measures (FKGL, FRE, SMOG index, word count), and expert accuracy ratings were analyzed. Pearson correlation coefficients were computed to examine relationships between different quality metrics. Statistical significance was set at p < 0.05 for all analyses.
Reporting GuidelinesThis study is reported in accordance with the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) guidelines for cross-sectional studies.

Results

A total of 27 unique queries about sacroiliac joint dysfunction were analyzed after exclusion of 12 unrelated and 11 duplicate queries from the initial 50 most frequently searched questions on Google Trends. The queries were categorized as follows: condition/disease (n = 10, 37.0%), medication/treatment (n = 7, 25.9%), symptom (n = 6, 22.2%), miscellaneous (n = 2, 7.4%), and investigation (n = 2, 7.4%).
Table 1 presents the readability metrics of ChatGPT-generated responses. The mean word count was 362.8 ± 152.2 (median: 382, IQR: 268-488, range: 19-566). The mean Flesch Reading Ease score was 43.8 ± 12.1 (median: 44.8, IQR: 41.4-48.0), indicating a difficult reading level. The mean Flesch-Kincaid Grade Level was 11.5 ± 9.1 (median: 9.1, IQR: 8.6-9.6), substantially exceeding the recommended sixth-grade level for patient education materials. The SMOG index averaged 10.7 ± 1.3 (median: 10.8, IQR: 10.2-11.3). All 27 responses (100%) exceeded the recommended sixth-grade reading level, and 26 responses (96.3%) had Flesch Reading Ease scores below 60, indicating difficult readability.
The quality assessment results are summarized in Table 2. PEMAT understandability scores were high, with a mean of 85.4 ± 5.0% (median: 85.0%, IQR: 83.0-89.0%), and all responses (100%) met the recommended threshold of 70%. PEMAT actionability scores averaged 77.6 ± 12.8% (median: 78.0%, IQR: 73.0-87.0%), with 85.2% of responses meeting the 70% threshold. The EQIP total score averaged 86.9 ± 8.1% (median: 91.7%, IQR: 85.4-91.7%), indicating high overall quality. JAMA benchmark scores were consistently low, with a mean of 0.96 ± 0.19 out of 4 (median: 1.00, IQR: 1.00-1.00), primarily due to the absence of explicit authorship attribution, disclosure statements, and currency dates in the AI-generated responses.
The distribution of expert accuracy ratings is presented in Supplementary Table 1. The intraclass correlation coefficient for the accuracy ratings between the two expert evaluators indicated excellent inter-rater reliability.

Discussion

The findings of this study showed that while ChatGPT demonstrated excellent accuracy and high scores on quality metrics, including PEMAT understandability (85.4%), PEMAT actionability (77.6%), and EQIP (86.9%), all responses exceeded the recommended sixth-grade reading level, with a mean Flesch-Kincaid Grade Level of 11.5. The excellent inter-rater reliability (ICC = 0.900) among expert evaluators confirmed the robustness of the accuracy assessment, and the majority of responses (77.8-88.9%) were deemed free of medical inaccuracies. These findings suggest that ChatGPT can generate medically accurate and well-structured patient education materials; however, the content's complexity may limit accessibility for patients with lower health literacy.
The readability challenges identified in this study align with recent investigations of AI-generated health information across multiple medical domains. Zhou et al.,⁹ evaluated ChatGPT and DeepSeek responses on spinal surgeries and similarly reported that AI-generated content consistently exceeded recommended reading levels, with mean FKGL scores ranging from 10.2 to 12.4.89] Scaff et al.³ found comparable results when assessing AI chatbot responses to low back pain questions, noting that, despite high accuracy, readability remained a significant barrier to patient comprehension.³ In the fibromyalgia domain, Zure and Menekşeoğlu,⁷ reported a mean FKGL of 11.8 for ChatGPT responses, reinforcing the pattern that AI-generated health content tends toward academic language rather than patient-friendly communication.⁷ This consistency across musculoskeletal conditions suggests a systematic limitation in current large language models' ability to automatically adjust language complexity for lay audiences, despite their capacity to generate medically accurate information.
The high PEMAT scores observed in this study, particularly for understandability (85.4%) and actionability (77.6%), contrast with poor traditional readability metrics. Haver et al.⁸, reported similarly high PEMAT scores for ChatGPT-generated breast cancer screening recommendations despite elevated reading levels, suggesting that AI-generated content maintains logical organization and clear presentation even with complex vocabulary.⁸ This discrepancy has been noted in other recent studies and may reflect fundamental differences in what these tools measure.^9,14
The medical accuracy of ChatGPT responses in this study was notably high, with 77.8-88.9% of responses rated by expert evaluators as containing no inaccuracies. Lee et al.,⁴ reported that GPT-4 demonstrates substantial medical knowledge and reasoning capabilities, though they cautioned that occasional errors are possible and that expert oversight is important.⁴ This finding aligns with the broader literature on ChatGPT performance in healthcare contexts.^15,16
The excellent inter-rater reliability (ICC = 0.900) observed in the current study strengthens confidence in these accuracy assessments and aligns with reliability metrics reported in similar evaluation studies. However, two responses (7.4%) received ratings of 4-5 on the Likert scale from at least one evaluator, indicating very high levels of inaccuracy. These outliers underscore the variable performance of AI systems and the critical need for healthcare professionals' review before disseminating AI-generated patient education materials, a recommendation echoed by Wang et al.,⁶ in their ethical framework for ChatGPT use in healthcare.⁶
Consistently low JAMA benchmark scores (mean = 0.96 out of 4) reflect an inherent limitation of AI-generated content that lacks traditional publishing infrastructure. ChatGPT responses lack author credentials, citations, disclosure statements, or date stamps—elements considered essential for evaluating the credibility of online health information.^6,17 This limitation is not unique to the current study; similar deficiencies have been documented in multiple investigations of AI-generated health content. The absence of these transparency markers raises important questions about patient trust and the potential for misinformation, particularly when AI-generated content is shared without proper attribution or a disclaimer. Future AI systems may need to incorporate automatic citation generation and transparency disclosures to enhance credibility and enable fact-checking. Until such capabilities are developed, healthcare institutions and platforms disseminating AI-generated content must implement clear labeling and oversight protocols to ensure patients understand the source and limitations of the information they receive.^18,19

Limitations

This study has several limitations that need consideration. First, the evaluation was based on a single time point using ChatGPT version 5.2, and other AI models are continuously updated, which may alter response characteristics. Second, the study focused exclusively on sacroiliac joint dysfunction queries identified through Google Trends, which may not represent the full spectrum of patient questions or concerns about this condition. Third, although multiple validated assessment tools were employed, these instruments were originally designed for human-authored content and may not capture all relevant dimensions of AI-generated material quality. Fourth, the study did not assess patient perspectives or actual comprehension, relying instead on expert evaluation and standardized metrics.

Conclusion

Future research should include patient-centered assessments to determine whether the documented quality metrics translate into meaningful understanding and appropriate health behaviors. Additionally, comparative studies of AI models, prompt engineering strategies to improve readability, and longitudinal assessments of AI performance over time would provide valuable insights into optimizing AI-generated patient education materials. Despite these limitations, this study provides important evidence on the potential and challenges of using ChatGPT for patient education about sacroiliac joint dysfunction and establishes a framework for comprehensive quality assessment of AI-generated health information.

Declarations

Ethics Declarations

Ethics committee approval was not required for this study because no human participants, patient data, or animal subjects were involved.

Animal and Human Rights Statement

This study did not involve human participants or animal subjects.

Informed Consent

Informed consent was not required because the study analyzed publicly available search queries and artificial intelligence–generated responses and did not involve human participants or identifiable personal data.

Data Availability

The datasets used and/or analyzed during the current study are not publicly available due to patient privacy reasons but are available from the corresponding author on reasonable request.

Conflict of Interest

The authors declare that there is no conflict of interest.

Funding

None.

Author Contributions (CRediT Taxonomy)

Hatice Zehra Ferhatlar: Conceptualization, Methodology, Validation , Investigation, Data curation, Writing – original draft
Mert Zure: Software, Writing – review & editing

Scientific Responsibility Statement

The authors declare that they are responsible for the article’s scientific content, including study design, data collection, analysis and interpretation, writing, and some of the main line, or all of the preparation and scientific review of the contents, and approval of the final version of the article.

Abbreviations

AI: Artificial Intelligence
CI: Confidence Interval
EQIP: Ensuring Quality Information for Patients
FKGL: Flesch-Kincaid Grade Level
FRE: Flesch Reading Ease
IBM: International Business Machines
ICC: Intraclass Correlation Coefficient
IQR: Interquartile Range
JAMA: Journal of the American Medical Association
NY: New York
PEMAT: Patient Education Materials Assessment Tool
SMOG: Simple Measure of Gobbledygook
STROBE: Strengthening the Reporting of Observational Studies in Epidemiology
USA: United States of America

References

Gartenberg A, Nessim A, Cho W. Sacroiliac joint dysfunction: pathophysiology, diagnosis, and treatment. Eur Spine J. 2021;30(10):2936-2943. doi:10.1007/s00586-021-06927-9

Article PubMed Google Scholar
Javadov A, Ketenci A, Aksoy C. The efficiency of manual therapy and sacroiliac and lumbar exercises in patients with sacroiliac joint dysfunction syndrome. Pain Physician. 2021;24(3):223-233. doi:10.36076/ppj.2021/24/223

Article PubMed Google Scholar
Scaff SPS, Reis FJJ, Ferreira GE, et al. Assessing the performance of AI chatbots in answering patients' common questions about low back pain. Ann Rheum Dis. 2025;84(1):143-149. doi:10.1136/ard-2024-226202

Article PubMed Google Scholar
Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388(13):1233-1239. doi:10.1056/nejmsr2214184

Article PubMed Google Scholar
Eltorai AE, Sharma P, Wang J, Daniels AH. Most American Academy of Orthopaedic Surgeons' online patient education material exceeds average patient reading level. Clin Orthop Relat Res. 2015;473(4):1181-1186. doi:10.1007/s11999-014-4071-2

Article PubMed Google Scholar
Wang C, Liu S, Yang H, et al. Ethical considerations of using ChatGPT in health care. J Med Internet Res. 2023;25(8):e48009. doi:10.2196/48009

Article PubMed Google Scholar
Zure M, Menekşeoğlu AK. Assessment of the artificial intelligence-generated fibromyalgia information: beyond the hype. Arch Rheumatol. 2024;40(3):358-364.

PubMed Google Scholar
Haver HL, Ambinder EB, Bahl M, et al. Appropriateness of breast cancer prevention and screening recommendations provided by ChatGPT. Radiology. 2023;307(4):e230424. doi:10.1148/radiol.230424

Article PubMed Google Scholar
Zhou M, Pan Y, Zhang Y, et al. Evaluating AI-generated patient education materials for spinal surgeries: comparative analysis of readability and quality. Int J Med Inform. 2025;198:105871. doi:10.1016/j.ijmedinf.2025.105871

Article PubMed Google Scholar
Boutemen L, Miller AN. Readability of publicly available mental health information: a systematic review. Patient Educ Couns. 2023;111:107682. doi:10.1016/j.pec.2023.107682

Article PubMed Google Scholar
Shoemaker SJ, Wolf MS, Brach C. Development of the patient education materials assessment tool: a new measure of understandability and actionability for print and audiovisual patient information. Patient Educ Couns. 2014;96(3):395-403. doi:10.1016/j.pec.2014.05.027

Article PubMed Google Scholar
Moult B, Franck LS, Brady H. Ensuring quality information for patients: development and preliminary validation of a new instrument to improve the quality of written health care information. Health Expect. 2004;7(2):165-175. doi:10.1111/j.1369-7625.2004.00273.x

Article PubMed Google Scholar
Silberg WM, Lundberg GD, Musacchio RA. Assessing, controlling, and assuring the quality of medical information on the internet: let the reader and viewer beware. JAMA. 1997;277(15):1244-1245. doi:10.1001/jama.1997.03540390074039

Article PubMed Google Scholar
Giray E, Korkmaz MD, Illeez OG, et al. Let’s chat about adolescent idiopathic scoliosis: accuracy and reliability of chat responses to frequently asked questions. BMC Musculoskelet Disord. 2025;26(1):1075. doi:10.1186/s12891-025-09315-2

Article PubMed Google Scholar
Lechien JR, Carroll TL, Huston MN, Naunheim MR. ChatGPT-4 accuracy for patient education in laryngopharyngeal reflux. Eur Arch Otorhinolaryngol. 2024;281(5):2547-2552. doi:10.1007/s00405-024-08560-w

Article PubMed Google Scholar
Shah YB, Ghosh A, Hochberg AR, et al. Comparison of ChatGPT and traditional patient education materials for men's health. Urol Pract. 2024;11(1):87-94. doi:10.1097/upj.0000000000000490

Article PubMed Google Scholar
Alowais SA, Alghamdi SS, Alsuhebany N, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23(1):689. doi:10.1186/s12909-023-04698-z

Article PubMed Google Scholar
Aydin S, Karabacak M, Vlachos V, Margetis K. Large language models in patient education: a scoping review of applications in medicine. Front Med (Lausanne). 2024;11:1477898. doi:10.3389/fmed.2024.1477898

Article PubMed Google Scholar
Esmaeilzadeh P. Challenges and strategies for wide-scale artificial intelligence deployment in healthcare practices: a perspective for healthcare organizations. Artif Intell Med. 2024;151:102861. doi:10.1016/j.artmed.2024.102861

Article PubMed Google Scholar

Additional Information

Publisher’s Note
Bayrakol MP remains neutral with regard to jurisdictional and institutional claims.

Rights and Permissions

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). To view a copy of the license, visit https://creativecommons.org/licenses/by-nc/4.0/

About This Article

Received:: February 8, 2026
Accepted:: April 24, 2026
Published Online:: April 29, 2026