Validation of a survey methodology for gastroesophageal reflux disease in China

Background Gastroesophageal reflux disease (GERD) causes a wide range of clinical symptoms and potentially serious complications, but epidemiological data about GERD in China are limited. The aim of this pilot study was to develop and validate a methodology for the epidemiological study of GERD in China. Methods Regionally stratified, randomized samples of Shanghai residents (n = 919) completed Mandarin translations of the Reflux Disease Questionnaire (RDQ), GERD Impact Scale, Quality of Life in Reflux and Dyspepsia (QOLRAD) questionnaire and 36-item Short Form Health Survey (SF-36). Reliability and construct validity were tested by appropriate statistical analyses. Results The response rate was 86%. The test-retest reliability coefficients for the RDQ, GERD Impact Scale, QOLRAD and SF-36 were 0.80, 0.71, 0.93 and 0.96, respectively, and Cronbach's alpha coefficients were 0.86, 0.80, 0.98 and 0.90, respectively. Dimension scores were highly correlated with the total scores for the QOLRAD and SF-36, and factor analysis showed credible construct validity for the RDQ, GERD Impact Scale and SF-36. The RDQ GERD score was significantly negatively correlated with QOLRAD dimensions of food and drink problems and social functioning, and was significantly negatively correlated with all dimensions of the SF-36. All eight of the SF-36 dimensions were significantly correlated with the QOLRAD total score. Conclusion This study developed and tested a successful survey methodology for the investigation of GERD in China. The questionnaires used demonstrated credible reliability and construct validity, supporting their use in larger epidemiological surveys of GERD in China.


Background
Gastroesophageal reflux disease (GERD) is a common disorder caused by backflow of stomach contents into the esophagus. As it can cause a wide range of clinical symptoms and potentially serious complications, the epidemiology of GERD has been a subject of much interest in recent years. GERD is frequently diagnosed on the basis of symptoms alone, with the criterion for diagnosis in clinical practice being when reflux symptoms become troublesome to the patient [1]. However, for epidemiological studies, a simple symptom threshold is required to identify those who have GERD. In many studies, this threshold is defined as at least weekly reflux symptoms [2]. GERD is common in the West, with a prevalence of about 10-20%, but the prevalence in Asia is generally lower at approximately 5% [2]. The prevalence of GERD is, however, thought to be increasing [3], with trends in Asia attracting particular interest [4]. There have been few high quality, population-based epidemiological surveys of GERD in Asia, particularly in China [5]. A number of methodological challenges associated with studying the epidemiology of GERD in this region may have contributed to this paucity.
To identify reflux symptoms accurately, validated patientcompleted questionnaires are needed, as clinicians tend to underestimate the presence and severity of reflux symptoms reported by patients [6]. In particular, validated symptom descriptors (e.g. 'burning behind the breastbone') are necessary because terms such as 'heartburn' are known to be poorly understood by patients [7]; this is of particular relevance to Chinese populations, because there is no word for 'heartburn' in Mandarin Chinese beyond specialist medical circles, and a survey in the USA revealed that only 13.2% of East Asian patients understood the term [7].
Within the Chinese population, language and cultural differences can lead to different communities perceiving and expressing their symptoms differently. In China, Mandarin is the official language, but about half the population does not speak it, particularly those living in rural areas and older people [8]. There are thousands of local dialects, many of which are mutually unintelligible when spoken. All use the same writing system, and overall literacy rates in China are high, but literacy among older people, women and those living in rural areas is relatively low; in the 2003 census, over 9.6% of women and 2.1% of men were illiterate or semi-literate [9].
Population surveys can be difficult to implement in China. Telephone surveys may introduce population bias in favour of the more wealthy urban Chinese population who are more likely to have telephones. The utility of postal surveys is limited by the ability of the respondent to understand the terms used [10] which, for questionnaires developed in the West, may be further compounded by cultural conceptual differences. Response rates to telephone or postal questionnaires may be low, potentially introducing responder bias [11,12]. For these reasons, previous population surveys of GERD in China have administered questionnaires using a face-to-face interview technique, in which subjects completed the questionnaire while being assisted by trained interviewers [10,13,14]. This technique has achieved high response rates and has enabled terms and definitions to be clarified appropriately for individual respondents.
In order to investigate the prevalence and impact of GERD in China and facilitate comparisons with other countries, linguistic and psychometric validation of internationally recognized disease-specific and generic patient-reported outcomes instruments is required. The aim of this pilot study was to develop and validate a methodology for the epidemiological study of GERD in China. The feasibility, validity and reliability of several well-designed questionnaires were tested in a Chinese environment using randomized, stratified, multi-stage cluster sampling, a statistical sampling technique adopted by the World Health Organization (WHO) [15] that is particularly well suited to the residential and social administration system in China.

Setting
Shanghai, on the east coast of China, is China's largest city. It is divided into 18 districts and one county, each of which is classified as urban, suburban, or rural ( Figure 1). Each district includes numerous blocks, which include The survey sites in Shanghai Figure 1 The survey sites in Shanghai. multiple residential areas, and the county covers several towns that govern a number of villages. Broadly speaking, people who live in an urban area have a city lifestyle, while people who live in a rural region lead a farming or country peasant way of life. The suburban lifestyle is intermediate between these two.

Sampling
A randomized, stratified, multi-stage cluster sampling methodology was used to select a representative sample of the general population in Shanghai. Huangpu was randomly selected from the nine urban districts, Pudong from the four suburban districts, and Songjiang from the five rural districts and one county of Shanghai. Blocks were randomly selected from districts and residential areas from blocks so that, finally, four residential areas in the urban district, three in the suburban district and two in the rural district were randomly selected (see Figures 1 and 2). The Residential Committee of each residential area supplied detailed household rosters of all adults, and subjects for this study were randomly sampled from these lists.
Pudong District consists of 26 towns and blocks, and is the biggest district in Shanghai. The residents in this district are widely dispersed and not all the information for each resident could be obtained. As information for all families in Pudong was available, families were randomly sampled from the selected residential areas and the family member with a birthday closest to the investigation date was selected.
According to the statistical formula n = t 2 pq/d 2 (where n, t, p, q and d are sample size, t value, positive rate, negative rate and acceptable error, respectively), assuming a GERD prevalence of 10%, and setting significance at P = 0.05 and acceptable error at 2%, the calculated sample size was 864 [16]. According to the 1 in 10 000 sampling proportion principle and the population size of Shanghai, the target sample size was 1300 respondents. Combining these two figures, a target sample size of 1000 valid respondents was deemed appropriate. Allowing for a 20% non-response rate, the final intended sample size was set at 1200, including 400 subjects from each district.
Residents under 18 years of age, or residents who were illiterate, had severe visual, hearing or learning disabilities, or major psychiatric illness, were excluded from the survey. Respondents who were not at home after three attempts to administer the questionnaire were considered to be missing.

Administration of questionnaires
Local residential committee staff informed residents of the survey and secured their support and understanding. The informed consent of respondents was obtained, and each respondent was free to discontinue participation in the study at any time. The study was approved by the Second Military Medical University Ethics Committee.
During the fieldwork period from November 2005 to January 2006, respondents completed questionnaires in their own homes or in local residential committee offices. Questionnaires were self-administered, with trained and supervised facilitators on hand to explain any questions that were unclear. The facilitators were social workers at the site, who were trained by supervisors who were professionals and graduate students from the Department of Health Statistics (DoHS), who received training from an epidemiology survey expert from the DoHS and a gastrointestinal specialist from Shanghai Hospital. Quality auditing was performed to ensure all questionnaires were completed properly. A valid questionnaire was one that had been audited and signed by a supervisor.

Questionnaires
Each respondent completed five questionnaires in Mandarin (see additional file 1: GERD questionnaire in English and Mandarin Chinese): a general information questionnaire and translations of four concise, well-validated, internationally recognized and frequently cited disease-specific and generic health questionnaires, chosen to facilitate comparison with other studies and minimize the length of the overall survey: the Reflux Disease Questionnaire (RDQ), the GERD Impact Scale, the Quality of Life in Reflux and Dyspepsia (QOLRAD) questionnaire, and the 36-Item Short-Form Health Survey (SF-36). The general information questionnaire collected information on Stratified, multi-stage randomized cluster sampling of urban, suburban and rural districts in Shanghai Figure 2 Stratified, multi-stage randomized cluster sampling of urban, suburban and rural districts in Shanghai. The RDQ is a 12-item self-report questionnaire measuring the frequency and severity of upper gastrointestinal symptoms (heartburn, regurgitation and epigastric pain) over the previous week. Symptom frequency and severity are scored on a 6-point Likert scale (0-5, where 5 is the most severe/frequent). A GERD dimension score can be obtained by combining the heartburn and regurgitation scores [17]. Subjects reporting heartburn and/or regurgitation of any frequency during the 1-week recall period of the questionnaire were defined as having GERD. The RDQ was validated for use in clinical trials in two large studies [18,19], and was also recently validated for use as a diagnostic tool in the DIAMOND study (Diagnostic Tool for the Management of Patients with Reflux Disease) [20]. A Chinese version of the RDQ was tested in 10 hospitals in mainland China, and was found to identify accurately the presence of symptoms suggestive of GERD experienced over the previous month [21].
The GERD Impact Scale questionnaire is an eight-item self-report questionnaire designed to aid patient-physician communication in primary care. It assesses the frequency of gastroesophageal reflux symptoms over the past 2 weeks and their impact on everyday activities such as sleep, work, meals and social occasions, and the use of additional medication (other than that prescribed). Four response options for frequency are provided (1-4) where 1 is 'all of the time' and 4 is 'none of the time'. This newly developed tool has demonstrated good psychometric properties [22].
The GERD-specific version of the QOLRAD questionnaire is a 25-item disease-specific quality-of-life instrument measuring the impact of upper gastrointestinal symptoms over the previous week on five dimensions: emotional well-being, sleep, vitality, eating/drinking, and physical/ social functioning [23]. The frequencies of effects are reported using a 7-point Likert scale, with low scores indicating frequent impairment. Its reliability and validity have been extensively documented in studies of patients with upper gastrointestinal symptoms [23][24][25].
The SF-36 is a generic questionnaire assessing health status and well-being over the past 4 weeks. It contains 36 items clustered in eight dimensions: physical functioning, role-physical, bodily pain, general health, vitality, social functioning, role-emotional, and mental health, plus one item assessing change in health status over the previous year [26]. Item scores for each dimension are coded, summed and transformed to a scale from 0 (worst possible health state) to 100 (best possible health state). Its reliability and validity are widely documented across a range of language versions [27,28].

Translation and cognitive debriefing
Apart from the SF-36, where validated Mandarin translations already exist [29], questionnaires were translated and tested in the Department of Medicine, Faculty of Medicine, at the University of Hong Kong. Literal translation of Hong Kong Chinese into mainland Chinese (Mandarin) was undertaken by investigators and a panel of mainland gastroenterologists so that questionnaires were more interpretable by people from mainland China. This process was followed by cognitive debriefing, where five literate volunteers from mainland China who had a diagnosis of GERD (heartburn and/or acid regurgitation over the past year) completed the translated questionnaires and were interviewed to assess their understanding and interpretation. The overall relevance and clarity of the questionnaire were assessed using defined responses (very low; low; moderate; high; very high) and subjects were asked to specify any items that they regarded as irrelevant or unclear. Subjects considered the questions to be relevant and clear (grading: moderate to very high). No additional revisions were required.

Statistical analysis Data management
Questionnaire responses were coded and double-entered by two independent professional data-entry staff from the DoHS. EpiData software [30] was used to check for consistency between the two sets of data entries to ensure data quality. For the RDQ, QOLRAD, and SF-36, where at least 50% of items in a dimension were completed, the mean value of the completed items was used to impute the missing values. Where more than 50% of items were missing, the dimension score was excluded from the analysis [31][32][33]. For the GERD Impact Scale, if an item score was missing, imputation was not performed and the score was excluded from the analysis.

Reliability
Internal consistency was evaluated using Cronbach's alpha coefficient to determine the extent to which items within each questionnaire were interrelated [34]. Cronbach's alpha coefficients for each questionnaire were cal-culated by correlating all individual item scores with dimension scores and/or the overall score. An alpha coefficient above 0.70 suggests good internal consistency and reliability.
Test-retest reliability is a measure of the stability of the instrument under different conditions with the same respondent; in this study, it was assessed by retesting 10% of respondents (n = 40 from each region) 2-7 days after the baseline test. Cohen's kappa coefficient and the intraclass correlation coefficient (ICC) were used to analyze the test-retest reliability of the survey instruments. Cohen's kappa coefficient was used in the analysis of categorical and ranked measurements, while ICC was used to analyze quantitative measurements. A test-retest coefficient above 0.70 was considered acceptable [35].

Construct validity
Construct validity evaluates whether an instrument actually measures the phenomena that it theoretically predicts; correlation and factor analysis were used to evaluate construct validity in this study. Factor analysis using principal component analysis and quartimax rotation explored whether the factor structure of each questionnaire was supported. Factor loadings larger than 0.50 within one dimension were considered to support the factor construct provided the factor loadings were low across the other dimensions, with cumulative rates used to show the contributions of combinations of principal components [36]. Correlation analysis tested the construct validity of questionnaires containing multiple dimensions (i.e. RDQ, QOLRAD and SF-36). The analysis measured the strength of association between dimension scores and the total score for QOLRAD and SF-36 questionnaires, and between item scores and dimension scores for the RDQ. A strong correlation coefficient was considered to be over 0.6, a moderate correlation, 0.3-0.6, and a weak correlation below 0.3 [37].
Convergent validity analyzes whether the postulated dimension of an instrument correlates appreciably with all other dimensions from other instruments that should theoretically be related to it. Convergent validity was investigated in this study by correlating the GERD dimension from the RDQ with SF-36 and QOLRAD dimensions, and SF-36 dimensions with QOLRAD total score. A decrease in health-related quality of life was expected for respondents with GERD symptoms.

Response rate
Of the 1200 randomly pre-selected subjects, 1034 agreed to be interviewed (a response rate of 86%). In the Pudong District, a total of 112 respondents' questionnaires were withdrawn from the statistical analysis due to one facilitator's failure to adhere to the study protocol. A further three questionnaires from the Huangpu District were excluded due to incompleteness. Therefore, a total of 919 questionnaires (359 from the urban region, 224 from the suburban region, and 336 from the rural region) were included in the analysis after quality auditing. The mean response rates for items in each questionnaire are provided in Table  1.
Of 120 subjects randomly selected for retest, 113 agreed to be re-interviewed (a 94% response rate). Fourteen questionnaires were rejected because they were not completed in line with the study protocol, leaving 99 questionnaires for inclusion in the retest analysis.

Respondents
The respondents' average age was 47 years (ranging from 18 to 77 years); 55% were female and the majority of respondents (85%) were married. Most respondents did not smoke (74%) or drink alcohol (83%). The average BMI was 22.6 kg/m 2 , with a range of 14.4-36.5 kg/m 2 . Level and years of education, current job type and income level all varied significantly between the three regions (p < 0.0001). Education levels and family income were greatest for the urban region and lowest for the rural region (Table 2), reflecting the socioeconomic divide that exists between urban and rural China. Forty percent of urban respondents were professionals or technicians, while 73% of rural respondents and 44% of suburban respondents were agricultural or fishery workers.    Table 3. All coefficients were ≥ 0.7, demonstrating good reliability and internal consistency for each questionnaire.

Construct validity
Each dimension score was highly correlated with the total score for both QOLRAD and SF-36 (p < 0.001), indicating good construct validity. For QOLRAD, Spearman correlation coefficients ranged from 0.77 for physical/social functioning to 0.91 for food and drink problems and for vitality, among respondents reporting symptoms of heartburn and/or regurgitation via the RDQ. For SF-36, Spearman correlation coefficients ranged from 0.53 for social functioning to 0.77 for general health, for the study population as a whole. The RDQ also demonstrated good construct validity (Table 4), with each dimension correlating most strongly with the individual items comprising it (Spearman correlation coefficients 0.62-0.94). Regurgitation items correlated strongly with the GERD dimension as expected, but the weaker correlation with heartburn items may have been due to the low prevalence of heartburn in the Shanghai population.
Factor analysis was used to explore whether the predicted factor structure of the questionnaire was supported. Credible construct validity was demonstrated for the RDQ, GERD Impact Scale and SF-36 questionnaires. All RDQ items correlated as expected in the factor analysis apart from the frequency and severity of 'pain behind breastbone', which correlated more strongly with the epigastric pain dimension than the heartburn dimension ( Table 5).
The cumulative rate of the three factors was 72.1%. All GERD Impact Scale items correlated with factors as expected ( Table 6). The cumulative rate of the four factors was 78.0%.
For SF-36, the cumulative rate of the eight factors plus health transition item was 71.3%. Most items correlated with factors as expected (see Table 7), with particularly high correlations seen for role-physical and bodily pain dimensions. The physical functioning (PF) items were distributed into two dimensions; PFa included moderate to vigorous activities such as lifting or carrying groceries, climbing several flights of stairs and walking more than one mile, whereas PFb included less strenuous activities such as climbing one flight of stairs, bending, kneeling, walking one or several blocks, and bathing or dressing oneself. The social function dimension was unclear, distributing to mental health and role-emotional dimensions. In addition, two items from the vitality dimension, two from the mental health dimension and one from the physical functioning dimension were distributed into the general health dimension. The three role-emotional items showed a tendency towards distribution into the rolephysical dimension, although the correlation coefficients were lower than those for distribution into the expected role-emotional dimension.
The factor analysis showed that the construct validity of QOLRAD was not as good as expected, as items were not distributed to the appropriate dimensions (Table 8).

Convergent validity
The RDQ GERD score was negatively correlated with all QOLRAD dimensions; correlations were statistically significant for the QOLRAD dimensions of food and drink

RDQ item Heartburn dimension Regurgitation dimension GERD dimension Epigastric pain dimension
Burning behind breastboneseverity  The RDQ GERD score correlated most strongly with bodily pain (the SF-36 dimension most impaired by GERD in previous studies), reflecting the fact that GERD is primarily a painful disease. All eight SF-36 dimensions were significantly correlated with the QOLRAD total score (p ≤ 0.001, correlation coefficients ranged from 0.16-0.29), supporting the construct validity of QOLRAD and SF-36.

Discussion
This pilot study used several well-designed questionnaires, administered together, with the aim of developing and validating a methodology for the epidemiological study of GERD in China. Using a randomized, stratified, multi-stage cluster sampling technique, we validated Chinese translations of the SF-36, QOLRAD questionnaire, GERD Impact Scale and RDQ. In this study, the translated and adapted questionnaires demonstrated reproducibility and internal consistency within the methodology adopted, although responsiveness was not assessed. Each questionnaire had a test-retest reliability coefficient larger than 0.7 and a high Cronbach's alpha coefficient (≥ 0.8), suggesting good reliability. The construct validity of questionnaires was also credible in this survey, although the QOLRAD did not perform well in the factor analysis. This  was likely to be due to linguistic and cultural translation problems: facilitators considered that some items were difficult to explain to respondents, particularly for those with a low level of education.
The sampling and administration techniques contributed substantially to the success of this study. By gaining the support of local residential communities, a high response rate of 86% was achieved, which is likely to prevent significant responder bias. The provision of assistance from trained facilitators helped avoid potential cultural and linguistic confusion, providing a relatively precise interpretation of the items in the questionnaire, and is recommended for future epidemiological studies using this survey instrument in order to ensure accuracy.
Chinese translations of the SF-36 have previously undergone psychometric validation among Chinese-speaking peoples in mainland China, the USA, Hong Kong and Taiwan [29,[38][39][40][41]. These studies demonstrated satisfactory psychometric characteristics for SF-36 in these groups, while highlighting a level of cultural variation between Western and Chinese versions and between the different Chinese cultures. There is a tendency, also reflected in the current study, for the social functioning dimension to perform less well in China [29]; Li and colleagues have commented that this points to the Confucian ideology of collectivism in China, where it is socially unacceptable for Chinese to use 'sickness' as an excuse to avoid working or socializing [29]. In several previous studies vitality was more strongly associated with mental health than physical health [29,[38][39][40], which may relate to traditional Chinese medicine, where fatigue associated with depression is conceptualized as a deficiency of vital energy or 'qi'. Although this was not the case in the current study, two items in the vitality dimension were more strongly distributed to general health. These issues illustrate the importance of examining the psychometric validity of instruments in different ethnic groups with cultural differences in language, values and perceptions of health.
This study has several limitations. Some subjects found the combined questionnaire too long and repetitive: a general information questionnaire, the RDQ, GERD Impact Scale, QOLRAD and SF-36 combined to make a total of 137 items and, on average, the questionnaire took about 20 minutes to complete. Responsiveness to change and known-groups validity were not assessed. Where construct validity was assessed, the different recall periods for individual questionnaires may have weakened convergent correlation results, while the short retest period may distort the reliability analysis where respondents remember their previous responses. The methodology was unable to sample migrant workers, who make up a significant portion of the Shanghai population, as they remain officially registered in their place of origin.

Conclusion
The experience gained in this pilot study will inform a planned larger study of the epidemiology of GERD across mainland China, which will establish the wider prevalence of GERD symptoms in China using representative study populations and a standardized, well-validated methodology. The survey questionnaire will be reduced in length and simplified, and symptoms will be assessed using the RDQ with a longer recall period (4 weeks). The QOLRAD questionnaire will be removed from the survey, due to its relatively poor performance in the factor analysis. Ideally, responsiveness to change and known-groups validity should be studied to investigate further the validity of the survey instruments. Health-related quality of life will be evaluated using the SF-36, and sleep disturbance will be investigated using the Epworth Sleepiness Scale (ESS). Endoscopic examination of randomly sampled subjects would also be informative, to allow comparison with recent studies conducted in the West [42,43].
In summary, this study developed and tested a successful survey methodology for the epidemiological study of GERD in China. The questionnaires used demonstrated credible reliability and construct validity, supporting their use in larger epidemiological surveys of GERD in China, and allowing the results of this study to be extrapolated to the general population of East China.

Authors' contributions
YC and XY participated in the acquisition of data, analysis and interpretation of data, and drafting the article. XQM and RW participated in the analysis and interpretation of data, and drafting and critically revising the article. SJ and MAW participated in the conception and design of the study, and critically revising the article. JH made substantial contributions to the conception and design of the study, supervised all aspects of its implementation, and critically revised the article. All authors read and approved the final manuscript.