Abstract
Aim
We aimed to evaluate the quality and readability of ChatGPT’s answers to frequently asked questions about fibromyalgia (FM).
Material and Methods
The most frequently searched terms related to FM were identified using Google Trends and entered into the ChatGPT-4 model in order of their rankings. The responses were categorized using the Ensuring Quality Information for Patients (EQIP) tool and evaluated based on quality and readability. Quality and readability were assessed with EQIP, the Flesch-Kincaid Grade Level (FKGL), and the Flesch-Kincaid Reading Ease (FKRE).
Results
According to Google Trends data, the search frequency for the term “FM” increased from 2004 to 2024, with a peak of 100% in March 2020. ChatGPT’s responses were assessed in terms of both quality and readability, revealing notable shortcomings. The average scores for EQIP, FKGL, and FKRE were 39.89, 13.29, and 21.41, respectively. Furthermore, the statistical analysis among the four categories showed no significant variations in EQIP, FKGL, and FKRE scores.
Conclusion
Our study revealed significant deficiencies in the quality of ChatGPT’s responses regarding FM. There is a need for more understandable and reliable information to improve communication in healthcare.
INTRODUCTION
Fibromyalgia (FM) is a relatively common chronic pain syndrome within the general population, with a global prevalence rate of approximately 2-3%. In addition to chronic widespread pain, the FM clinical profile is characterized by fatigue, sleep disturbances, cognitive impairment, autonomic dysfunction, somatic symptoms, and psychiatric disorders, all occurring without any underlying serious medical conditions (1). The pathogenesis of FM is incompletely understood; however, it is thought to involve a combination of genetic predisposition, stressful environmental factors, inflammatory processes, and central mechanisms such as pain centralization (2). The diagnosis of FM is made primarily based on clinical findings. Although physical examination and laboratory tests may not be definitive for diagnosis, they are crucial in excluding other conditions. The diagnosis is ultimately established through a detailed medical history (1, 2). Patient education, physical exercise, pharmacological agents, and cognitive behavioral therapy are the four main components of the treatment (3).
The Internet has become an essential and irreplaceable resource for health-related information. It is now a significant source of information for patients with FM. The demand for information is heightened by the absence of specific diagnostic tools, evidence-based treatment recommendations for FM, and the ongoing debates surrounding the condition. Barriers to understanding specialized medical terminology and the chronic nature of FM further contribute to this increased demand (4). Research indicates that individuals living with FM frequently turn to the Internet to learn more about the disease and to support shared decision-making processes with their healthcare providers (4-6).
ChatGPT, developed by OpenAI (United States), is a highly advanced language model that leverages deep learning techniques to generate responses that closely mimic human language patterns. It is adept at understanding the subtleties and complexities of human communication, enabling it to produce contextually accurate and relevant responses across a wide range of topics (7).
In summary, FM exemplifies the challenges of patient education due to the lack of definitive diagnostic tests, the absence of universally accepted treatment guidelines, and widespread misconceptions about the condition. These factors often lead patients to seek information online, where the quality and accuracy of content vary greatly, and this may pose the risk of misinformation. FM was chosen as the focus disease for this study because it represents an especially relevant condition to evaluate the potential and limitations of AI-driven tools like ChatGPT in addressing global health information needs.
In this study, we aimed to evaluate the quality and readability of ChatGPT’s answers to frequently asked questions about FM.
MATERIAL AND METHODS
This study, conducted on August 24, 2024, at the PM&R clinics of Mersin City Training and Research Hospital and Adana City Training and Research Hospital, was exempt from ethics committee approval as it only utilized online information and did not involve any human or animal participants. Consequently, no informed consent form was required. In compliance with the ethical principles of the Declaration of Helsinki, this study adheres to all required ethical standards. As no human or animal data were utilized, obtaining approval from an ethics committee was not necessary (8, 9).
In this study, English was used as the primary language for conducting Google Trends analyses with the term “FM” generating ChatGPT responses, and evaluating these responses, using the Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL). This choice was made because English is a widely used international language, particularly in the fields of health, medicine, and scientific communication. By utilizing English terms, the study aimed to capture search behaviors from a broader audience across various regions and countries.
Before searches were made, all personal web browsing history was preemptively cleared to prevent interaction. The Google Trends tool was utilized to identify the top searched terms related to FM. Global searches under the health category, spanning from 2004 to August 25, 2024, were selected to determine top keywords associated with FM. The “most relevant” option was selected from the related questions section, while the “subregion” option was chosen from the geographical areas of interest section.
The top 25 searched terms were noted, encompassing a broad spectrum of topics in Google’s online searches. Two terms ”ms” and “me” identified during the process were excluded from the analysis as they were deemed irrelevant to the main context of FM-related searches.
The term “ICD-10 FM” was included as it appeared among the most frequently searched keywords on Google Trends. Although ICD-10 codes are primarily utilised by healthcare professionals, patients may also encounter these codes in their medical reports and seek to understand their meaning through online searches. Including this keyword ensured a comprehensive analysis of search behaviours and helped minimise selection bias, accurately reflecting the diversity of queries related to FM.
The identified keywords were entered into the ChatGPT-4 model sequentially, according to their rankings in Google Trends, and are presented in Table 1 in the same order.
Before evaluating ChatGPT’s performance with the keywords, the web browsing history was cleared, and separate chat pages were opened for each keyword. This approach helped to prevent potential interactions. The responses generated by ChatGPT-4 were systematically catalogued to evaluate the quality of information, readability, and comprehensiveness.
The 23 keywords were divided into four categories according to the criteria specified in the “Ensuring Quality Information for Patients” tool (EQIP): condition or illness; drug, medication, or product; treatment or management; and diagnosis, testing, or procedures (Table 1). The EQIP tool takes into account various parameters, including clarity of information, writing quality, accuracy, reliability, and comprehensiveness. Each of the 20 parameters in the EQIP scale was assigned a score of 1 for a “yes” response, 0.5 for a “partial” response, and 0 for a “no” or “not applicable (N/A)” response. The scores for each parameter were then summed and divided by the total number of parameters. Finally, the resulting EQIP score was calculated and expressed as a percentage. According to EQIP scores, texts can be classified as well-written and high quality, (76% to 100%), good quality with minor problems, (51% to 75%), having serious quality issues, (26% to 50%), or having severe quality issues, (0% to 25%) (9, 10). To minimize bias in the calculation of EQIP, the recorded responses were independently evaluated by two physical medicine and rehabilitation specialists in separate settings. In cases of discrepancy, the average values were taken.
The readability of the texts was analyzed using two key metrics: the FKGL and the FKRE. These parameters were employed to assess the complexity and accessibility of the text content.
The FKGL was determined through a series of specific calculations. Initially, the total number of words was divided by the total number of sentences, and this value was multiplied by 0.39. Subsequently, the total number of syllables was divided by the total number of words, with the resulting figure multiplied by 11.8. The results from these calculations were then added together, and finally, 15.59 was subtracted from the sum to yield the FKGL score. A lower grade level score signifies that the text is easier to understand, whereas a higher score indicates greater linguistic complexity and requires a more advanced level of comprehension.
The FKRE formula assesses a document’s readability by first calculating the average sentence length, which is then multiplied by 1,015, and the average number of syllables per word, which is multiplied by 84.6. The difference between these two products is subtracted from 206,835 to yield the document’s reading ease score. Higher scores reflect text that is easier to read, whereas a score of 30 or below indicates that the content is likely suitable for a reading level appropriate for a college graduate (11).
Statistical Analysis
The statistical program used in the study was IBM SPSS Statistics for Windows, Version 22.0 (Armonk, NY, USA). The normality of continuous variables was assessed using the Kolmogorov-Smirnov test and histogram plots. Categorical variables were presented as numbers (n) and percentages (%). For continuous variables that followed a normal distribution, Independent Samples t-test was used for comparisons between two groups, while One-Way Analysis of Variance was employed for comparisons among four groups. For continuous variables that did not follow a normal distribution, comparisons among four groups were performed using the Kruskal-Wallis test. In cases where the Kruskal-Wallis test identified significant differences, pairwise comparisons were conducted using the Mann-Whitney U test. Groups with insufficient sample sizes (n=1) were excluded from the analysis. Bonferroni correction was not applied for pairwise comparisons, as only a single Mann-Whitney U test was conducted. Intrarater reliability for each researcher and interrater reliability between the researchers were analysed using Cohen’s kappa coefficient for categorical data. A significance level of p<0.05 was considered statistically significant for all analyses.
RESULTS
The graph illustrates the temporal variation in search frequency for the term “FM” according to Google Trends data from 2004 to 2024 (Figure 1). In January 2004, the relative search interest for “FM” stood at 56%, while by August 2024, this interest had risen to 73%. Notably, the graph features a pronounced spike in March 2020, when the relative search interest reached 100%.
As depicted in the graph, Norway, Puerto Rico, and the United Kingdom are the top three countries where the term “FM” has been most frequently searched. Geographically, the highest search interest is concentrated in North America, Western and Northern European countries, and Australia (Figure 2).
Table 1 presents the 23 keywords obtained from Google Trends along with their corresponding categories as determined by the EQIP analysis.
The minimum, maximum, mean, and standard deviation values for EQIP, FKGL, and FKRE, as well as the percentages for the category of the topic based on EQIP, are presented in Table 2. Based on the EQIP analysis, the most frequently observed categories for this topic are “condition or illness,” which accounts for 57%, and “drug, medication, or product,” which represents 35% (Figure 3).
No statistically significant differences were found between Group 1 and Group 2 in terms of EQIP, FKGL, and FKRE values. In Groups 3 and 4, the standard deviation could not be calculated due to the response count being only one. Additionally, the statistical analysis across the four groups revealed no significant differences in EQIP, FKGL, and FKRE values (Table 3). The obtained kappa value (0.79) represents a high level of interrater reliability (p<0.001). The intrarater reliability analysis conducted after 109 days yielded kappa values of 0.87 for researcher 1 (Alper Uysal) and 0.85 for researcher 2 (Ertürk Güntürk), both statistically significant (p<0.001).
DISCUSSION
Our study revealed significant issues in the quality of responses provided by ChatGPT. The FKGL corresponded to a 13-year education level, while the FKRE score indicated that the texts were at a difficult readability level.
FM significantly impacts individuals’ lives, often manifesting as chronic widespread pain, sleep disturbances or non-restorative sleep, physical exhaustion, and cognitive difficulties. This condition involves a range of somatic and psychological symptoms and is frequently accompanied by co-morbid illnesses such as functional somatic syndromes (for instance, irritable bowel syndrome), anxiety and depressive disorders, and rheumatic diseases. These symptoms and associated conditions can profoundly complicate daily living, substantially diminishing quality of life (2).
When analysing Google Trends data for the term “FM” from 2004 to 2024, a significant spike is observed in March 2020, with the relative search interest peaked at 100%. On 11 March 2020, the World Health Organization declared COVID-19 a global pandemic (12), and by 13 March 2020, Europe was reported as the new epicentre of the crisis (13). The spike in searches for FM during this period may be linked to the heightened anxiety, depression, and stress caused by the pandemic, which could have exacerbated FM symptoms. Notably, the surge in FM searches coincided with the rapid increase in COVID-19 cases, suggesting a possible correlation between the pandemic’s impact on mental health and the worsening of FM symptoms. when evaluating this relationship, it is also important to consider Long COVID, which presents symptoms that closely resemble those of FM (14, 15).
Despite such a complicated clinical picture, no cure currently exists for FM, so treatment is centred on improving the patient’s ability to function while managing pain and other associated symptoms (16). Moreover, given the complicated aspects of FM, numerous myths and misconceptions abound. All of these leads patients to actively seek reliable information (17). Patients may wish to gain a deeper understanding of their condition and the available treatment options to actively engage in the decision-making process. An informed patient is better equipped to contribute meaningfully to decisions about their care and may experience reduced anxiety as a result. Conversely, without access to quality information, patients may struggle to discuss their findings coherently with their doctors and may be unable to make well-informed decisions (18, 19). 72% of internet users seek medical information online (20). However, it is argued that the quality of web information is often poor and is presented in a way that increases the likelihood of misunderstanding (4, 21). Basavakumar et al. (21) have shown that online resources concerning FM, including its etiology, symptoms, comorbidities, and management, are frequently incomplete, with content that may be difficult to access and susceptible to misinterpretation. Ozsoy-Unubol and Alanbay-Yagci (19) investigated the quality of online information on YouTube concerning FM. They found that, despite its variability, the content on the YouTube platform was generally of poor quality. Moreover, they suggested that this poor-quality information could mislead patients and potentially harm the doctor-patient relationship.
Previous research has highlighted that online resources for FM, including those on platforms like YouTube, are often incomplete or of poor quality (19, 21). Such deficiencies can mislead patients and hinder effective communication with healthcare providers. Although ChatGPT provides information that is quickly and easily accessible, our findings suggest that the quality and clarity of its responses are insufficient to fully meet patient needs. For example, EQIP scores for ChatGPT responses were lower than the threshold for high-quality information.
The integration of ChatGPT into our daily experiences has led to its widespread adoption across various fields, including medicine. Beyond its use in medical education, research, and clinical practice, ChatGPT has also been employed in patient education and information (22).
In this context, studies have been conducted on various diseases, including osteoarthritis, cerebral palsy (CP), and osteoporosis (8, 9, 23). Ata et al. (8) assessed the reliability and utility of ChatGPT’s responses concerning cerebral palsy and determined that it serves as a reliable, though partially useful, source of information. They emphasized the importance of patients and their families verifying the medical information obtained by consulting their healthcare providers. Erden et al. (9) assessed the quality, readability, and comprehensibility of the information provided by ChatGPT concerning osteoporosis. Their findings indicated that the quality and readability of ChatGPT’s information were insufficient for effective health management. Yang et al. (23) determined that ChatGPT’s responses do not always fully align with recommendations from evidence-based clinical practice guidelines, and they suggested that both healthcare professionals and patients should approach the guidance offered by AI platforms with measured expectations. They should recognize the current limitations in providing clinically sound advice. Consistent with existing literature on other diseases (8, 9, 23), our study revealed that the quality of responses provided by ChatGPT was insufficient. Additionally, we determined that the understandability of the content was categorized as difficult. We believe that the optimal approach to information transfer is to convey highly reliable information in a manner that is comprehensible to all segments of society.
In diseases like FM, where the etiopathogenesis, diagnosis, and treatment are complex, excessive reliance on medical information obtained through artificial intelligence and the internet could have adverse effects on patient health. While these tools offer quick and easy access to information, the principle that treatment should be tailored to the individual rather than the disease itself implies that general information may be insufficient or even inaccurate in complex and complicated cases (9, 24).
This study aimed to evaluate ChatGPT’s ability to generate globally relevant health-related responses by focusing on English-language keywords and international search trends. While regional trends are valuable, the study prioritized global applicability, leaving localized analyses for future research.
Future studies should explore keywords in other languages to understand how language and cultural differences influence the quality and accessibility of AI-generated healthcare content. This could help create more inclusive AI tools, addressing disparities in reliable health information for non-English-speaking populations.
The main limitations of the study are the exclusive focus on English terms, which could constrain the findings, and the selection of only 23 keywords. Expanding the keyword lists and incorporating other languages in future studies could yield more comprehensive results.
CONCLUSION
Our study revealed significant deficiencies in the quality and clarity of FM-related responses generated by ChatGPT, and these deficiencies raise concerns about its suitability for effective health communication. Healthcare professionals should guide patients in interpreting AI-generated information and verifying it with evidence-based sources. AI developers must enhance readability and alignment with clinical guidelines, to improve reliability. Patients should critically evaluate online health content and consult healthcare providers for accurate advice. Refining algorithms is crucial to advancing the accuracy and accessibility of AI-generated responses for more effective health communication. Future research should examine non-English keywords to enhance AI tools’ inclusivity and address health information disparities across different languages and cultures.