- Research
- Open access
- Published:
Accuracy of ChatGPT3.5 in answering clinical questions on guidelines for severe acute pancreatitis
BMC Gastroenterology volume 24, Article number: 260 (2024)
Abstract
Background
Guidelines must be interpreted comprehensively and correctly to standardize the clinical process. However, this process is challenging and requires interpreters to have a medical background and qualifications. In this study, the accuracy of ChatGPT3.5 in answering clinical questions related to the 2019 guidelines for severe acute pancreatitis was evaluated.
Methods and results
An observational study was conducted using the 2019 guidelines for severe acute pancreatitis. The study compared the accuracy of ChatGPT3.5 in English versus Chinese and found that it was more accurate in English (71%) than in Chinese (59%) (P value: 0.203). Additionally, the study assessed the accuracy of ChatGPT3.5 in answering short-answer questions versus true/false questions and found that it was more accurate in answering short-answer questions (76%) than in answering true/false questions (60%) (P value: 0.405).
Conclusions
For clinicians managing severe acute pancreatitis, ChatGPT3.5 may have potential value. However, it should not be relied upon excessively for clinical decision making.
Background
Acute pancreatitis (AP) is an inflammation of the pancreas, typically caused by gallstones or excessive alcohol consumption. The global incidence of AP is 30–40 cases per 100,000 people per year, with more than twice the incidence in some regions [1, 2]. In 80–90% of cases, AP is a moderate disease (interstitial oedematous acute pancreatitis). In about 10–20% of severe cases, it is a life-threatening disease with an in-hospital mortality rate of about 15% [3]. Acute pancreatitis severity is classified as mild, moderate, or severe based on the 2012 Atlanta Classification, revised through international consensus. Severe acute pancreatitis is defined as organ failure lasting more than 48 h.
Patients with severe acute pancreatitis (SAP) often experience disruption, while those in hospitals may feel a mix of hope, sorrow, and a desire to overcome the disease. Recovery from SAP can be a challenging journey, both physically and emotionally, highlighting the importance of healthcare. The internet is the most commonly used resource for patients seeking medical information. ChatGPT, a rapidly emerging large language model (LLM) that utilizes artificial intelligence, has the potential to be an effective tool for patient education in healthcare [4]. As of June 2023, the website has received over 1.8 billion visits, indicating its immense popularity. Research has shown GPT-3.5 to be at or near the cut score for the US Medical Licensure Examination [5]. However, according to a recent study [6], participants with varying levels of clinical experience held differing opinions on the use of ChatGPT for clinical practice and medical education. A larger percentage of attending staff and residents expressed disagreement with the use of ChatGPT. Although the medical students were satisfied with the responses that were consistent with the textbooks and sounded authentic, the more experienced physicians were able to identify the shortcomings of the responses. The purpose of this article was to assess the accuracy of the ChatGPT3.5 in answering clinical questions about severe acute pancreatitis.
Methods
Source of the question
The survey was conducted in December 2023. We formulated 34 short-answer questions and 15 true/false questions using the following method. The 2019 WSES Guidelines for the Management of Severe Acute Pancreatitis contain 27 clinical questions. Questions 2, 13, 16, 17, and 22 can each be divided into 2–4 sub-questions. We excluded the 11th, 23rd, 25th, and 26th questions due to either a lack of detailed answers or the presence of specialized terminology that ChatGPT3.5 could not understand. We ultimately agreed on 34 short-answer questions. To better assess ChatGPT3.5’s ability to clarify the guidelines, we created 15 true/false questions regarding the recommendations.
Accuracy assessment
We asked ChatGPT-3.5 each question in Chinese and English. To examine the reproducibility of ChatGPT’s responses, each question was entered into ChatGPT twice, and both responses were recorded. When a question is answered, if the content does not match, there are two options for selection. If the format does not match, a note will be added requesting one option. During the assessment phase, two senior critical care medicine specialists evaluated each response in both Chinese and English(Table 1). The participants were instructed to classify each response as either ‘correct’ or ‘incorrect’ based on its adherence to the guidelines and their own medical expertise. The accuracy of the responses was determined based on the following standards:
When answering a true/false question, the response is considered correct only if it is an exact match. In the event that the response was “may be” “can be considered,” or “may be beneficial” it was regarded as an affirmative response. Responses such as ‘Unavailable’, ‘Difference of opinion’, or ‘I’m not a doctor’ were considered incorrect(2). Numerical response questions were considered accurate if the number provided was an exact match(3) .Written response questions were considered accurate if there was no discrepancy between the answer and the content of the answer given by ChatGPT. Even if not all elements were included, the question was considered correct if it contained no errors. Any text with clear errors was deemed incorrect.
Reproducibility was evaluated by each reviewer independently by comparing the two responses to each question. In cases where there was disagreement between the two reviewers, a third senior expert from the Department of Critical Care Medicine was consulted to resolve the issue.
Statistical analysis
The response rates for Chinese and English queries were calculated separately and reported as percentages. The accuracy of responses in Chinese and English was compared. Subgroup analyses were conducted to compare the accuracy difference between short-answer and true/false questions in English and Chinese. The proportion of correct responses between groups was compared using chi-square tests. All statistical analyses were conducted using SPSS for Windows (version 27.0). The statistical significance threshold was set at P < 0.05. To assess reproducibility, we compared two responses to the same question. Responses that matched perfectly were deemed reproducible, while those that did not match were categorized. If both responses were categorized as either ‘correct’ or ‘incorrect,’ they were still considered reproducible. However, if they were categorized differently, they were considered significantly different. Questions with significant differences are expressed as percentages.
Results
We asked ChatGPT a total of 49 questions. Overall, the accuracy of the questions was higher in English than in Chinese (71% vs. 59%, P value: 0.203) (Fig. 1). The ChatGPT demonstrated a higher accuracy rate for short-answer questions (76%) in English compared to right-or-wrong questions (60%, P value:0.405) (Fig. 2). In Chinese, no significant difference in accuracy was found between the right-or-wrong questions (60%) and the short-answer questions (59%, P value:0.938) (Fig. 3).
To evaluate reproducibility, we calculated the percentage of questions with significant differences from the total number of questions and then plotted it as detailed in Table 2.
Discussion
In this study, we assessed the quality of medical information provided by ChatGPT3.5 regarding severe acute pancreatitis by comparing the consistency of the answers given by ChatGPT3.5 with the recommendations of the 2019 guidelines. The study found that ChatGPT3.5 responses showed moderate to good reliability when presented in English, but average accuracy when presented in Chinese. Short-answer questions were more accurate than true/false questions in English, but this difference was not significant in Chinese.
There are several studies documenting ChatGPT’s use in healthcare, particularly for medical questions, but the results are inconsistent. According to a study, ChatGPT produces responses that are near the passing threshold for the United States Medical Licensing Examination (USMLE) [7]. In contrast, Suchman et al. used ChatGPT-3 and ChatGPT-4 to respond to the 2021 and 2022 American College of Gastroenterology self-assessment tests. They found that both models were unable to pass the test on multiple-choice single-answer questions, and therefore do not recommend their use in their current form for medical education in gastroenterology [8]. Although no previous studies have evaluated the accuracy of ChatGPT responses to the severe acute pancreatitis guidelines, specifically the difference in accuracy for different questions, several studies have looked at the consistency of ChatGPT responses with the guidelines. A previous study evaluated the quality of medical information provided by ChatGPT-4 on the five hepato-pancreatico-biliary (HPB) diseases with the highest global burden of disease. Recommendations from the UK National Institute for Health and Care Excellence (NICE) guidelines on gallstone disease, pancreatitis, cirrhosis/portal hypertension and pancreatic ductal adenocarcinoma, as well as the European Association for Study of the Liver (EASL) guidelines, were restated as questions and input into ChatGPT AI. ChatGPT responses were then documented and compared to the original guideline statements. The study found that there was a 60% agreement between the guideline recommendations and the answers provided by ChatGPT. Subgroup analysis showed the highest scores for pancreatitis [9]. Kusunose et al. evaluated ChatGPT’s accuracy in answering Japanese Society of Hypertension guideline-related clinical questions and found an overall accuracy of 64.5%, with no statistical difference between Japanese and English queries [10]. The guidelines for Prostate Cancer 2023 by the European Association of Urology (EAU) [11] were connected to ChatGPT 4.0 using the Link Reader plugin (OpenAI, https://chat.openai.com). Additionally, 10 real vignettes of prostate cancer patients were included, copied unedited from the MDT meeting. These vignettes contained at least the patient’s age, PSA, MRI staging, and prostate biopsy histopathology. The AI was instructed to recommend the best treatment option for the specific patient in accordance with the associated guideline. The study results indicate that the treatments suggested by ChatGPT were not significantly different from the decisions made during actual MDT meetings [12]. In this report, the accuracy of the results is comparable to the previous two studies using ChatGPT 3.5 without referencing the guide.Clinical decision making in healthcare involves diagnosing, testing, and treating diseases, which directly impacts prognosis. As healthcare evolves and guidelines are updated, recommendations for the same disease may change. Therefore, it is necessary to interpret guidelines accurately to avoid errors by clinicians, especially when answering true/false questions. The study suggests that ChatGPT may not be entirely suitable for clinical settings due to its accuracy rate of only 60%. Additionally, it is important to consider that ChatGPT’s responses to Chinese queries often include uncertainty words such as ‘may be’ and ‘can be considered’, which account for 20% of the cases. These cases are categorized as ‘yes’, which can significantly increase the accuracy rate. Similarly, in English queries, it’s important to avoid subjective opinions unless clearly marked as such. Phrases like “I don’t have specific information” and “opinions may vary” should be avoided as they can affect accuracy. However, the current data does not show any statistical significance between the accuracy of true/false questions and short-answer questions due to the small sample size. A study with a large sample size should be conducted to further investigate the accuracy of ChatGPT in answering true/false questions. In the future, ChatGPT should focus on improving its accuracy in identifying correct and incorrect questions to enhance its applicability in clinical practice. Alternatively, a plugin could be developed to link specific guidelines and ensure more accurate interpretation of the guidelines.
Our study has some limitations. Firstly, the responses presented were generated in December 2023, and conversation-based AI models are rapidly evolving, which may result in different responses. Secondly, the sample size is relatively small, which may limit the generalizability of our findings. Third, some questions in the guidelines may require the evaluator to incorporate their own clinical knowledge, which could introduce subjectivity. Fourth, our study only examined the 2019 guidelines for the management of severe acute pancreatitis. Further research is necessary to determine the effectiveness of ChatGPT in responding to questions related to other medical specialties and guidelines. Finally, using carefully designed prompts such as ‘according to the most recent guideline’ or ‘according to the guideline from a specific year’ may result in a higher percentage of correct answers.
Conclusion
In summary, ChatGPT3.5 has a high accuracy rate in answering quiz questions related to the 2019 guidelines for managing severe acute pancreatitis. However, the accuracy rate varies depending on the language and question types. The findings suggest that for clinicians who need rapid access to essential information regarding the management of severe acute pancreatitis, ChatGPT3.5 has the potential to be a valuable tool. However, it is important to note that the current accuracy of ChatGPT is insufficient to assist clinicians in making judgments and determining the disposition of diseases. Therefore, further improvements to ChatGPT should focus on optimizing language and question types.
Declarations.
Data availability
Data is provided within the manuscript or supplementary information files.
References
Petrov MS, Yadav D. Global epidemiology and holistic prevention of pancreatitis. Nat Rev Gastroenterol Hepatol. 2019;16(3):175–84. https://doi.org/10.1038/s41575-018-0087-5.
Iannuzzi JP, King JA, Leong JH et al. Global Incidence of Acute Pancreatitis Is Increasing Over Time: A Systematic Review and Meta-Analysis. Gastroenterology.2022;162(1):122–134. https://doi.org/10.1053/j.gastro.2021.09.043
van Santvoort HC, Bakker OJ, Bollen TL, et al. A conservative and minimally invasive approach to necrotizing pancreatitis improves outcome. Gastroenterology. 2011;141(4):1254–63. https://doi.org/10.1053/j.gastro.2011.06.073.
Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel). 2023;11(6):887. Published 2023 Mar 19. https://doi.org/10.3390/healthcare11060887
Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. https://doi.org/10.1371/journal.pdig.0000198. Published 2023 Feb 9.
Tangadulrat P, Sono S, Tangtrakulwanich B. Using ChatGPT for clinical practice and Medical Education: cross-sectional survey of medical students’ and Physicians’ perceptions. JMIR Med Educ. 2023;9:e50658. https://doi.org/10.2196/50658. Published 2023 Dec 22.
Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States Medical Licensing examination? The implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. https://doi.org/10.2196/45312. Published 2023 Feb 8.
Suchman K, Garg S, Trindade AJ. Chat Generative Pretrained Transformer fails the Multiple-Choice American College of Gastroenterology Self-Assessment Test. Am J Gastroenterol. 2023;118(12):2280–2. https://doi.org/10.14309/ajg.0000000000002320.
Walker HL, Ghani S, Kuemmerli C, et al. Reliability of Medical Information provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument. J Med Internet Res. 2023;25:e47479. https://doi.org/10.2196/47479. Published 2023 Jun 30.
Kusunose K, Kashima S, Sata M. Evaluation of the Accuracy of ChatGPT in answering clinical questions on the Japanese Society of Hypertension guidelines. Circ J. 2023;87(7):1030–3. https://doi.org/10.1253/circj.CJ-23-0308.
Mottet NEAU. September – EANM – ESTRO – ESUR – ISUP – SIOG Guidelines on Prostate.Cancer [Internet]. EAU Guidelines. Edn. presented at the EAU Annual.Congress Milan 2023. https://d56bochluxqnz.cloudfront.net/documents/full-guideline/EAU-EANM-ESTRO-ESUR-ISUP-SIOGGuidelines-on-Prostate-Cancer-2023_2023-03-27-131655_pdvy.pdf.Accessed 2023.
Gabriel J, Gabriel A, Shafik L, Alanbuki A, Larner T. Artificial intelligence in the urology multidisciplinary team meeting: can ChatGPT suggest European Association of Urology guideline-recommended prostate cancer treatments? BJU Int Published Online November. 2023;27. https://doi.org/10.1111/bju.16240.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
J Q and YL Z conceptualized, designed, collected, analyzed, and interpreted the data, and wrote the manuscript. L L reviewed and critiqued the guideline-related issues and revised the manuscript. This is the final version approved for submission by all authors. All authors take personal responsibility for their contributions and ensure the accuracy and completeness of the work.
Corresponding author
Ethics declarations
Human ethics and consent to participate declarations
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Qiu, J., Luo, L. & Zhou, Y. Accuracy of ChatGPT3.5 in answering clinical questions on guidelines for severe acute pancreatitis. BMC Gastroenterol 24, 260 (2024). https://doi.org/10.1186/s12876-024-03348-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12876-024-03348-y