Inter-observer agreement in the assessment of endoscopic findings in ulcerative colitis

Background Endoscopic findings are essential in evaluating the disease activity in ulcerative colitis. The aim of this study was to evaluate how endoscopists assess individual endoscopic features of mucosal inflammation in ulcerative colitis, the inter-observer agreement, and the importance of the observers' experience. Methods Five video clips of ulcerative colitis were shown to a group of experienced and a group of inexperienced endoscopists. Both groups were asked to assess eight endoscopic features and the overall mucosal inflammation on a visual analogue scale. The following statistical analyses were used; Contingency tables analysis, kappa analysis, analysis of variance, Pearson linear correlation analysis, general linear models, and agreement analysis. All tests were carried out two-tailed, with a significance level of 5%. Results The inter-observer agreement ranged from very good to moderate in the experienced group and from very good to fair in the inexperienced group. There was a significantly better inter-observer agreement in the experienced group in the rating of 6 out of 9 features (p < 0.05). The experienced and inexperienced endoscopists scored the "ulcerations" significantly different. (p = 0.05). The inter-observer variation of the mean score of "erosions", "ulcerations" and endoscopic activity index in mild disease, and the scoring of "erythema" and "oedema" in moderate-severe disease was significantly higher in the inexperienced group. A correlation was seen between all the observed endoscopic features in both groups of endoscopists. Among experienced endoscopists, a set of four endoscopic variables ("Vascular pattern", "Erosions", "Ulcerations" and Friability") explained 92% of the variation in EAI. By including "Granularity" in these set 91% of the variation in EAI was explained in the group of inexperienced endoscopists. Conclusion The inter-observer agreement in the rating of endoscopic features characterising ulcerative colitis is satisfactory in both groups of endoscopists but significantly higher in the experienced group. The difference in the mean score between the two groups is only significant for "ulcerations". The endoscopic variables "Vascular pattern", "Erosions", "Ulcerations" and Friability" explained the overall endoscopic activity index. Even though the present result is quite satisfactory, there is a potential of improvement. Improved grading systems might contribute to improve the consistency of endoscopic descriptions.


Background
Ulcerative colitis (UC) is one of the major challenges in a gastroenterology practice and the endoscopic findings are essential in evaluating the disease activity [1,2]. Nevertheless, our previous work demonstrated that the detailed description of mucosal inflammation is quite often unsatisfactory in the endoscopy reports [3]. The severity of mucosal inflammation in UC is usually evaluated with Barons' endoscopic activity index (EAI), introduced and validated forty years ago [4]. Later minor modifications have never been validated [5][6][7]. This index assesses the four items "vascular pattern", "friability", "erosions" and "ulcerations" [7], but even "erythema", "oedema", "granularity", and "blood in lumen" are signs of mucosal inflammation. However, it is not clear to what extent these different endoscopic features contribute to the overall assessment of the EAI [4].
Endoscopic findings are most frequently assessed on fixed point scales, or simply described by dichotomous variables (present/absent) [8]. However, endoscopic features of mucosal inflammation are continuous variables posing potential drawbacks with discrete scales for scoring. Previous studies have shown benefits of visual analogue scale (VAS) in the assessment of mucosal lesions while at the same time improving study power [9][10][11][12].
The aim of this study was to evaluate how endoscopists assess signs of mucosal inflammation in ulcerative colitis, the inter-observer agreement, the variance of the mean score, and the influence of the observers' experience, as well as the correlation between the eight endoscopic features and their individual contribution to the EAI.

Methods
Five patients, presenting with varying degrees of ulcerative colitis were admitted for endoscopic examination. The examination was videotaped with a Sony DV CAM DSR-20 MDP digital video recorder permitting lossless editing and copying. The videos were edited and a clip of about 30 s from each patient was shown by means of a high resolution video projector to an audience of endoscopists. Fifteen experienced endoscopists, (more than 750 colonoscopies), and 21 inexperienced (less than 200 colonoscopies) were asked to assess eight endoscopic signs of colitis and perform an overall score of the mucosal inflammation on a VAS according to the EAI. (Figure 1, Table 1) For each clip the mean EAI of the experienced group was used as the "gold standard". To determine the interobserver agreement the observers' scores for the five video clips were inter-rated by transforming the VAS score to an ordinal value ranging from 1-5.

Statistical analysis
All continuously distributed variables are expressed by mean values with standard deviations and 95% confidence intervals [13]. All tests were carried out two-tailed, with a significance level of 5%. In order to analyse the inter observer agreement the following agreement procedure was used. The inter-rated scores of the 5 video clips for all observers were compared to the "Gold Standard", expressed in contingency tables and the inter-observer agreement presented by a κvalue [14]. The strength of agreement is respectively very good for a κ-value between 0.81-1.00, good for 0.61-0.80, moderate for 0.41-0.60 and fair for 0.21-0.40 [13]. Kappa analysis was used for comparison of experienced and inexperienced observers with regard to agreement [14].
Observers' scores were compared using analysis of variance (ANOVA) [15]. The coefficient of variance normally expressed as the relation SD/ is inappropriate to compare scores of wide range. This coefficient was modified and expressed as the relation SD/MP (MP = midpoint of the scale).
Pearson linear correlation analysis was performed in order to study the correlation pattern between variables within groups of observer. In order to express EAI as a function of the set of independent variables, general linear models (GLM) was used [15].
To study the agreement between EAI and the Gold Standard within groups of observers the agreement analysis was used [13,16]. The two sets of measurements were compared using Student T-test for paired samples [13]. The EAI and the Gold Standard were plotted against each other and Pearson linear regression analysis performed [13]. The intercept was tested against 0 and the regression coefficient against 1 by using the studified test methodology [15]. The agreement limits and the agreement index were calculated [13,15].

Inter-observer agreement
The observers rating of the endoscopic findings are represented in a contingency table ( Table 2). The kappa analysis showed that the inter-observer agreement ranged from very good to moderate in the experienced group and from very good to fair in the inexperienced group. In the experienced group it was respectively very good for the items "erosions", "ulcerations", "friability" and "EAI". It was good for the findings "erythema", "oedema, "granularity", and "blood in lumen" and moderate for "vascular pattern". In the inexperienced group this agreement was very good for the item "friability", good for the items "ulceration" and "EAI", moderate for the items "erythema", oedema", "granularity", "blood in lumen" and "erosions" and only fair for the item "vascular pattern" (Table 3).   Bleeding mucosa either at contact or spontaneously

Mean score Normal Mucosa
Comparing experienced to inexperienced endoscopists there was no significant difference between the mean score of any endoscopic findings, but the coefficient of variance of the mean score was significantly higher (p < 0,01) in the inexperienced group when scoring "erythema" (Table 4)

Mucosal inflammation
A significant difference in the observed mean score was only seen in the assessment of ulcerations in moderatesevere and severe disease (p = 0.05). The coefficient of variance was significantly higher in the inexperienced group for the scoring of "erosions", "ulcerations" "blood in lumen" and "EAI" in mild disease (p < 0.01), for the scoring of vascular pattern" and "blood in lumen" in mild-moderate disease (p < 0,01), for the scoring of "vascular pattern", erythema and oedema in moderate-severe disease (p < 0.01, p < 0.01, p = 0.04 respectively) and for the scoring of "vascular pattern"erythema", "erosions" and "friability" in severe disease (p = 0.04, p < 0.01, p < 0.01, p = 0.01 respectively) ( Table 4).

Inter-correlation between endoscopic findings and the EAI
Significant correlation was found between all the observed endoscopic variables in both groups of endoscopists (Table 5). In the group of experienced endoscopists, the set of endoscopic variables "Vascular pattern", "Erosions", "Ulcerations" and Friability" linearly explains 92% of the variation in EAI. By including interaction between the observed endoscopic variables, the explanation increases to 97%. "Erosions" with its interactions was found to be the most important endoscopic variable for the experienced endoscopists explaining the variation in EAI.
By including "Granularity" in the set, 91% of the variation in EAI was explained in the inexperienced group. This explanation increased to 95% by including interactions between the observed variables.

Correlation between the observed EAI and the "Gold standard"
Comparing the EAI to the "Gold standard" no significant difference was detected between the two groups of endoscopists (Table 6). However, the linear relationship     X X X X X X X X X between the two variables was found to differ significantly (p < 0.01) from the line of equality (Y = X) in the group of inexperienced endoscopists (Fig. 2a). No such differences were detected for the experienced endoscopists (Fig. 2b). The agreement index in the inexperienced group was found fairly good compared to excellent for the experienced endoscopists and the percent of outliers was 6.7% in the inexperienced group compared to 5.3% in the experienced (Fig. 2c &2d). (Table 6). No significant correlation was found between the mean and the absolute difference of EAI and "Gold Standard".

Discussion
This study demonstrated an overall satisfactory interobserver agreement in rating the endoscopic lesions, but significantly better in the experienced group. Comparing experienced to inexperienced endoscopists the only significant difference observed was in the mean score of "ulcerations" but the inter-observer variation of the mean score was more pronounced in the inexperienced group particularly in the assessment of mild and moderate disease.
These results are to some extent in contrast to earlier studies. The study of Baron et al. in the era of fiberendoscopic instruments presented the advantage that the endoscopic findings were assessed live in 60 patients by three different observers [4], nevertheless this study was of limited value because not all types of mucosal lesions were present and only three observers working in team for 20 months prior to the study were involved. However, various endoscopic lesions showed a variable degree of observer agreement and the only feature presenting a good agreement was friability. No attempt was made to correlate the endoscopic findings neither to the histological nor the clinical severity. According to the result they proposed an endoscopic grading of "activity" in ulcerative colitis still in use, further evaluation of this grading has not been preformed. Orlandi et al. demonstrated variable inter-observer agreement, when 46 still images of ulcerative colitis were evaluated by four experienced and 11 inexperienced endoscopists [17]. Minoli et al. emphasised the importance of common recognition and definition of endo-scopic findings. Prior to their study the observers were briefed to agree on the definition of different endoscopic signs, and the study was only started when an agreement on the definition of lesions was reached. They found a good concordance in the assessments of 12 out of 16 findings in IBD [18]. Some of the terms and their attributes in these studies are either not currently used or are not according to the Minimal Standard Terminology version 2.0 (MST) approved by most of the international endoscopy organisations (OMED, ESGE, ASGE, JSGE) [19].
The differences between the present study and the previous ones might be explained by differences in study design, and analysis. In fact neither of the studies correlated the agreement and the variance to the degree of inflammation, and the methods of presenting the images was different, as was the method of assessing endoscopic findings.
The difference in mean score and in coefficient of variance was lowest in the assessment of the EAI, this might partially be due to the four step score normal-mild-moderatesevere. An identical grading of "erythema", "oedema", and "granularity" might reduce the differences between experienced and inexperienced endoscopists as well as the inter-observer variation of the mean score. This grading was not used because it is not a part of the MST but should probably be introduced as attribute values to these items. The VAS might be less appropriate for the scoring of countable lesions like ulcers and erosions and for sensitive features of mucosal inflammation present in nearly any degree of inflammation like absence of vascular pattern explaining the tight score range from 7.7 in mild disease to 8.9 in severe disease. This might bias the interrating of the "vascular pattern" and reduce the interobserver agreement and explain why the present study showed a lower kappa value for "vascular pattern" compared to the observations of Orlandi et al. The interobserver agreement of all other endoscopic signs was superior in the present study.
There was a good correlation between the grading of all endoscopic features. In the experienced group four features "vascular pattern", "friability", "erosions", and "ulcerations" explained the EAI, but changing the attributes values of other features might affect this result. Even if in the inexperienced group granularity was added to explain the EAI we still assume that the first four items are sufficient to describe inflammatory changes of the mucosa.
Inflammatory mucosal changes present in other GI diseases are also graded, eg oesophagitis with the LA-classification [20]. However the LA-classification, evaluating the extent of mucosal breaks in a short segment, is quite different from the more complex score of diffuse mucosal inflammation in UC characterised by at least 4 different endoscopic features. This might explain why Lundell et al did not demonstrate any difference between experienced and inexperienced endoscopists [20].

Regression analysis and agreement plots for experienced and inexperienced endoscopists
The "Gold Standard" was chosen to be the mean EAI of the experienced endoscopists because one of the main purposes was to assess the influence of experience.
Technical factors like natural light spectre reproduction of the CCD chip might contribute to the variance in the grading of erythema [21].
The live examination is of course the gold standard for the evaluation of the colonoscopy with the possibility to reexamine any suspicious lesion, but it is not practically feasible to show the same endoscopy to a sufficient number of observers at the same time. This live situation is probably best imitated by reviewing high quality videotapes with the possibility for the observer to play and rewind the movie. It permits more detailed examination of the mucosa. The friability might be easier to assess if the intubation and the retraction of the scope is recorded.
In routine examinations the problems of variance in assessing endoscopic findings might be reduced by systematic image documentation of gastrointestinal endoscopies as recommended by ESGE [22] (fig 3).

Conclusions
The inter-observer agreement in the rating of endoscopic features characterising ulcerative colitis is satisfactory in both groups of endoscopists but significantly higher in the experienced group. The difference in the mean score between the two groups is only significant for "ulcerations". But the variance of the mean score might influence the follow up of the patients and bias the results in studies especially of mild and moderate ulcerative colitis. The endoscopic variables "Vascular pattern", "Erosions", "Ulcerations" and Friability" explained the overall endoscopic activity index. Even though the present result is quite satisfactory, there is a clear potential of improvement. Improved grading systems might contribute to improve the consistency of endoscopic descriptions.