A comparison of Child-Pugh, APACHE II and APACHE III scoring systems in predicting hospital mortality of patients with liver cirrhosis

Background The aim of this study was to assess the prognostic accuracy of Child-Pugh and APACHE II and III scoring systems in predicting short-term, hospital mortality of patients with liver cirrhosis. Methods 200 admissions of 147 cirrhotic patients (44% viral-associated liver cirrhosis, 33% alcoholic, 18.5% cryptogenic, 4.5% both viral and alcoholic) were studied prospectively. Clinical and laboratory data conforming to the Child-Pugh, APACHE II and III scores were recorded on day 1 for all patients. Discrimination was evaluated using receiver operating characteristic (ROC) curves and area under a ROC curve (AUC). Calibration was estimated using the Hosmer-Lemeshow goodness-of-fit test. Results Overall mortality was 11.5%. The mean Child-Pugh, APACHE II and III scores for survivors were found to be significantly lower than those of nonsurvivors. Discrimination was excellent for Child-Pugh (ROC AUC: 0.859) and APACHE III (ROC AUC: 0.816) scores, and acceptable for APACHE II score (ROC AUC: 0.759). Although the Hosmer-Lemeshow statistic revealed adequate goodness-of-fit for Child-Pugh score (P = 0.192), this was not the case for APACHE II and III scores (P = 0.004 and 0.003 respectively) Conclusion Our results indicate that, of the three models, Child-Pugh score had the least statistically significant discrepancy between predicted and observed mortality across the strata of increasing predicting mortality. This supports the hypothesis that APACHE scores do not work accurately outside ICU settings.


Background
The recognition of risk factors that can stratify a population of cirrhotic patients into subgroups with different survival is of great prognostic value for the clinician. Numerous attempts have been made to develop a reliable prognostic survival model for cirrhosis. The target population of the different scoring systems in the literature covers patients with liver cirrhosis [1][2][3][4][5][6][7], alcoholic liver disease [8,9], variceal bleeding [10][11][12][13][14][15][16][17], and upper gastrointestinal bleeding including variceal bleeding [18][19][20]. The Child-Turcotte classification [1] and its subsequent modification by Pugh [10] are old empiric methods to assess hepatocellular functional reserve in candidates for portosystemic shunting. Although Child-Turcotte and Child-Pugh scores (CPS) have not been formally evaluated for their statistical accuracy, they have been useful for risk-stratifying groups of patients with cirrhosis [21][22][23], for assessing the efficacy of interventional procedures such as transjugular intrahepatic portosystemic shunting [24,25] or sclerotherapy [17,26], and for evaluating therapy for complications of cirrhosis [27][28][29]. Although CPS score is considered an adequate method to establish the degree of liver failure and the survival probability [30], two of its elements are very subjective (ascites and encephalopathy), and a further limitation is its limited discriminatory ability [7]. In some studies, the prognostic value of CPS is described as incomplete, and other variables are demonstrated to have prognostic significance [31]. In addition, prognostic factors unrelated to hepatic function (cardiac, renal, pulmonary, acid-base and electrolyte status, other important associated comorbid conditions and factors) are not included.
Acute Physiology, Age and Chronic Health Evaluation (APACHE) II and III scores were developed by Knaus et al in 1985 and1991, respectively [32,33] are being used mainly for critically ill patients of all disease categories admitted to the intensive care units (ICUs). They differ in how chronic health status is assessed, in the number of physiologic variables included (12 vs. 17), and in the total score. Specific parameters of liver function (i.e. serum bilirubin and albumin) are included only in the APACHE III scoring system. Some prognostic variables (e.g., prothrombin time) and other indicators of responses to therapy (e.g., blood units transfused) which are known to be important outcome predictors in cirrhotic patients are not measured by the acute physiology scores [17,21,22,26]. APACHE II and III scores have been successfully used to risk stratify cirrhotic patients admitted to medical ICUs [34][35][36][37][38]. APACHE II has been previously used to risk stratify a mixed population of both ICU and non-ICU cirrhotic patients with upper gastrointestinal bleeding [39], while recently, an incomplete APACHE III score (i.e. a score in which data for blood gas analysis were omitted) has been reported to be superior to CPS in risk stratifying cirrhotic patients outside ICU settings [40].
The aim of the present study was to compare the prognostic accuracy of Child-Pugh, 24 hour APACHE II and complete 24 hour APACHE III scoring systems in predicting hospital mortality of patients with liver cirrhosis admitted to a gastroenterological medical ward.

Methods
This prospective study included two hundred consecutive hospitalizations of 147 patients with liver cirrhosis admitted to the Department of Gastroenterology of the University Hospital of Heraklion, from February 1999 through January 2001. For the purpose of the study, each admission was considered as one patient. The criterion for inclusion was the presence at admission or in the past history of any of the major complications of cirrhosis (ascites, encephalopathy, variceal bleeding or spontaneous bacterial peritonitis). Patients transferred from elsewhere were included in the study only if the transfer occurred within 24 hours after initial admission. Patients with hepatocellular carcinoma and patients admitted for less than one day were excluded. Patients admitted to a medical ICU during the first 24 hours of their presentation were also excluded. The diagnosis of cirrhosis was based on liver biopsy in 93 out of 200 patient admissions (46.5%). For the remaining 107 patient admissions, the diagnosis of cirrhosis was based on clinical, laboratory and radiological criteria: history of portal hypertension excluding other etiologies, evidence of esophageal varices confirmed by endoscopy, splenomegaly, ascites confirmed by abdominal ultrasound and physical examination, impaired liver function tests and clotting profile, ultrasound or computer tomography criteria [39,41].
To calculate the APACHE II score [32], twelve common physiological and laboratory values (temperature, mean arterial pressure, heart rate, respiratory rate, oxygenation (PaO 2 or A-aDo 2 ), arterial pH, serum sodium, serum potassium, serum creatinine, haematocrit, white blood cell count and Glasgow coma score) are marked from 0 to 4, with 0 being the normal, and 4 being the most abnormal. The sum of these values is added to a mark adjusting for patient age and a mark adjusting for chronic health problems (severe organ insufficiency or immunocompromised patients) to arrive at the APACHE II score.
APACHE III scores range from 0 to 299 and are derived from marks for the extent of abnormality of 17 physiologic measurements (the acute physiology score), adjusts for age, and adjusts for seven comorbidities that reduce immune function and influence hospital survival [33]. The 17 physiological variables include eleven laboratory parameters (haematocrit, white blood cell count, serum creatinine, serum BUN, serum sodium, serum albumin, serum bilirubin, blood glucose, PaO 2 , A-aDO 2 , and a scoring for acid-base abnormalities), five vital signs (pulse, mean blood pressure, temperature, respiratory rate, urine output) and a modified Glasgow coma score.
Clinical and laboratory data necessary to the CPS and APACHE systems and prothrombin time (PT) values were recorded on the first day for all patients. Physiological data (temperature, heart rate, mean blood pressure and respiratory rate) were recorded 3-hourly during the first 24 hours of admission. The calculation of APACHE II and III scores was based on the worst values taken during the first 24 hours after admission.

Statistical analysis
Chi-square test was used to assess the differences of mortality within Child-Pugh classes A, B, and C. Individual relationship of each score (CPS, APACHE II, APACHE III) and PT values to the risk of death was assessed by t-test. For the assessment of the magnitude of correlation of length of stay (LOS) with CPS, APACHE II and APACHE III, Pearson correlation was used. Descriptive statistics were expressed as mean ± SD unless otherwise stated. Discrimination was tested using the receiver operating characteristic (ROC) curves and by comparing areas under the curve (AUCs) [42]. AUCs between 0.7 and 0.8 were classified as "acceptable" and between 0.8 and 0.9 as "excellent" discrimination [43]. For the different scoring systems tested, the sensitivity, specificity, overall correctness of prediction, positive and negative predictive values were calculated, and the cutoff point giving the best Youden index was determined [44]. This cutoff point was also used to calculate the predicted and observed outcome for patients. In order to test the overall classification accuracy of APACHE III score in association with PT, we applied discriminant analysis (backward stepwise method). A P value less than 0.05 was considered statistically significant for all above analyses. Calibration was assessed using the Hosmer-Lemeshow goodness of fit statistic which divides subjects into deciles based on predicted probabilities of death and then computes a chi-square from observed and expected frequencies [45]. Lower chi-square values and higher P values are associated with a better fit. A good fit was defined as P > 0.05.
Forty nine cases (24.5%) were classified as Child-Pugh class A, 88 cases (44%) as class B and 63 cases (31.5%) as class C. No deaths were recorded among patients with Child-Pugh class A. Two patients with Child-Pugh class B and 21 with class C died. Mortality increased significantly with increasing Child-pugh classes (P < 0.001). Table 1 shows that there were significant differences in CPS, APACHE II score, APACHE III score, and PT between survivors and non-survivors. Table 2 reports predictive values of the various scoring systems calculated at the cutoff point giving the best Youden index. ROC curves are shown in Figure 1. Discrimination power of CPS AUC and APACHE III AUC was excellent, while that of APACHE II AUC was acceptable. When information regarding PT values were combined with APACHE III score into a new discriminant function, the overall classification accuracy of APACHE III was not improved, thus PT was deleted from the full model (non-significant at the 5% level). The results of Hosmer-Lemeshow goodness-of-fit tests are shown in Table 3, while deciles risk are shown in Tables 4, 5 and 6. The Hosmer-Lemeshow statistic was best for CPS. However, for the two APACHE scores, calibration was poor.
The median LOS for survivors was 9 days (range 2-85 days), 7 days (range 2-17 days) for patients with Child-Pugh class A, 9 days (range 2-48 days) for Class B, and 15 days (range 2-85 days) for those with class C. CPS and APACHE III score correlated strongly with the duration of hospitalization (P < 0.001), while APACHE II score had a weak and non significant correlation.

Discussion
The performance of the prognostic models is evaluated by their discrimination and calibration. Discrimination (i.e the ability of a prognostic score to classify patients correctly as survivors or non-survivors) is measured by AUC [42,43]. Calibration evaluates the degree of correspondence between the estimated probabilities of mortality produced by a model and the actual mortality experience of patients and can be tested using Hosmer-Lemeshow goodness-of-fit statistic [45].
In our series, discrimination was acceptable to excellent for Child-Pugh and APACHE scores, however both APACHE prognostic systems had inadequate goodness-offit for death. Our results for APACHE II and CPS discrimination compare well with those published by Afessa et al [39]. In their study, the prognostic value of APACHE II (AUC 0.78) was as good as that of Child-Pugh score (AUC 0.76) in predicting short-term outcome of 111 cirrhotic patients hospitalized for upper gastrointestinal bleeding, although no informations regarding correct classification rates, sensitivity, specificity, cutoff values and goodnessof-fit have been assessed. The reported APACHE II mean All data reported as mean ± SD All data were recorded during the first 24 hours after admission to the ward N = number of patients *in seconds Butt et al reported that by using discriminant analysis, APACHE III score correctly classified 75% of cases vs. 67% of cases for Child-Pugh score [40]. No cutoff values were reported, the overall model calibration was not tested and data from blood gas analysis were not included in the calculation of the APACHE III score, thus resulting in an incomplete score. The APACHE III mean values were found high for both survivors and non-survivors (58.9 ± 35.1 and 87.4 ± 30.3 respectively). This might be related to the high percentage of patients admitted with upper gastrointestinal tract bleeding (i.e. 57%). Since four out of five vital signs (pulse, mean blood pressure, respiratory rate, and possibly urine output) and some of the laboratory parameters (i.e. haematocrit, serum BUN, and possibly creatinine) which need to arrive at the APACHE III score are markedly affected by bleeding, this might be the reason of the observed higher scores. Furthermore, the authors did not specify if they have included patients admitted to a medical ICU during the first 24 hours of their admission, whereas patients with hepatocellular carcinoma were also included in the study. The reported mortality on day 1 was 26% and 68% in patients with an APACHE III score of 51 to 75 points and greater than 75 respectively. It is note-worthy that in our series 17 out of 67 patients (25.3%) with an APACHE III score of 51 to 75 and 6 out of 11 patients (54.5%) with an APACHE III score greater than 75 also died. This suggests that at least in this sub-group of sicker patients our results compare well.

Figure 1 ROC curves for CPS, APACHE II and APACHE III scoring systems
There are many potential reasons for insufficient calibration of APACHE scores. Clinically useful predictive models should demonstrate ease of use, accuracy, reproducibility and acceptance by data collecting stuff [46]. Some variables of the APACHE scores (i.e heart rate) depend on continuous monitoring. In addition, it has been shown that the inter-observer variability is high when these scoring systems are not used on a regular basis (like in most non-ICU wards), thus affecting the accuracy and reproducibility of the data [47,48]. This is potentially relevant in our study, since physiological data collection was performed by several physicians and over a long period of time (24 months). As previously suggested [49] we tried to minimize variability by having one person to coordinate the process of data collection and having a written reference of definitions based on the original articles of APACHE scores.
Another potential reason for the inadequate calibration is the differences in level of disease severity between our database and the development databases of the mortality prediction systems [50]. Statistically derived prediction models like the APACHE systems are calibrated to the   [34,35]. In our series, APACHE II scores equal or greater than 17 and 22, and APACHE III scores equal or greater than 75 and 80 were recorded in only 25 (12.5%), 5   (2.5%), 11 (5.5%) and 8 (4 %) patients respectively, thus emphasizing the much lower level of disease severity in our patients. It should be also recognized that the wide 95% CI of our AUCs (Table 2) suggests sample size problem, especially when only 23 patients died.
Potential limitations of our study should also be mentioned. Our study was performed in an academic referral hospital; therefore our results may not be applicable to institutions with different patient populations. Because mathematical equations for APACHE III have not been published and for APACHE II this equation is available only for admission, these equations have not been used to calculate the relative risk of death. In agreement with other studies [34,35,37,39,40], we wanted to test the accuracy of single-score values. Patients admitted to a medical ICU during the first 24 hours of their presentation were excluded from our study, thus resulting in a mortality rate of only 11.5%. It could be stated that the rational of excluding these patients weakens our study, since sicker patients at presentation are more likely to die. However, physiological data included in APACHE III score are recorded 3hourly during the first 24 hours of admission and the worst value at this time interval is taking into account to calculate the total score [33]. Furthermore, we aimed to define within a 24 hour interval patients not sick enough to be admitted in a medical ICU, but who are likely not to benefit from the standard therapy and for whom a more intensive monitoring and treatment might be tried.

Conclusions
In conclusion, we cannot recommend the use of APACHE II and III scores in non-ICU patients. The present study showed that the discrimination power of CPS AUC and APACHE III AUC was excellent, while that of APACHE II AUC was acceptable. Although the Hosmer-Lemeshow statistic revealed adequate goodness-of-fit for CPS, this was not the case for APACHE II and III scores. Our results indicate that between the three scores, CPS had the least statistically significant discrepancy between predicted and observed mortality across the strata of increasing predicting mortality. This supports the hypothesis that APACHE scores do not work accurately outside ICU settings.