Clinical Data based XGBoost Algorithm for infection risk prediction of patients with decompensated cirrhosis: a 10-year (2012–2021) Multicenter Retrospective Case-control study

Objectives To appraise effective predictors for infection in patients with decompensated cirrhosis (DC) by using XGBoost algorithm in a retrospective case-control study. Methods Clinical data were retrospectively collected from 6,648 patients with DC admitted to five tertiary hospitals. Indicators with significant differences were determined by univariate analysis and least absolute contraction and selection operator (LASSO) regression. Further multi-tree extreme gradient boosting (XGBoost) machine learning-based model was used to rank importance of features selected from LASSO and subsequently constructed infection risk prediction model with simple-tree XGBoost model. Finally, the simple-tree XGBoost model is compared with the traditional logical regression (LR) model. Performances of models were evaluated by area under the receiver operating characteristic curve (AUROC), sensitivity, and specificity. Results Six features, including total bilirubin, blood sodium, albumin, prothrombin activity, white blood cell count, and neutrophils to lymphocytes ratio were selected as predictors for infection in patients with DC. Simple-tree XGBoost model conducted by these features can predict infection risk accurately with an AUROC of 0.971, sensitivity of 0.915, and specificity of 0.900 in training set. The performance of simple-tree XGBoost model is better than that of traditional LR model in training set, internal verification set, and external feature set (P < 0.001). Conclusions The simple-tree XGBoost predictive model developed based on a minimal amount of clinical data available to DC patients with restricted medical resources could help primary healthcare practitioners promptly identify potential infection. Supplementary Information The online version contains supplementary material available at 10.1186/s12876-023-02949-3.


Introduction
The natural history of cirrhosis is characterized by an asymptomatic compensated phase followed by a decompensated phase, marked by the development of overt clinical signs, the most frequent of which are ascites, bleeding, encephalopathy, and jaundice [1][2][3].Patients with decompensated cirrhosis (DC) are critically ill with high mortality.A study has shown that, compared with compensated cirrhosis, the annual mortality rate of patients with DC reaches 20%, which is much higher than the 7% of patients with compensated cirrhosis [4].At the same time, patients with DC have more complications, and infection is the most common complication [5].There are many kinds of infection caused by cirrhosis, such as spontaneous bacterial peritonitis (SBP) [6,7], urinary system infection [8], and spontaneous bacteremia [9,10].Infection is also an important inducing factor of severe complications such as upper gastrointestinal bleeding, hepatic encephalopathy, and hepatorenal syndrome, and is one of the main causes of death of patients with advanced liver cirrhosis [11][12][13].Over the past few decades, various cohort studies have evaluated SBP-related in-hospital mortality.From December 1984 to February 1989, the Liver Unit at the University of Barcelona Hospital Clinic reported a 38% in-hospital mortality in 185 consecutive cirrhotic patients with SBP [14].In another 10-year cohort study (from 1988 to 1998), Maryland hospitals reported that 112 of 343 patients with SBP died in the hospital, with a mortality rate of 32.6% [15].Thus, patients with DC complicated with infection usually have a poor prognosis.Therefore, identifying the risk factors of DC complicated with infection and constructing the prediction model are of great significance for improving the prognosis quality and reducing the risk of mortality in DC Patients.
As an artificial intelligence, machine learning algorithm has been applied in the field of disease prediction and diagnosis [16][17][18].Classical machine learning algorithms and models include decision tree model and integration tree model, among which support vector machines (SVM) [19] and neural network models (NNs) [20] are more commonly used, while XGboost is the most commonly used integration tree algorithm [21].Among many machine learning algorithms and models, logistic regression (LR) is more suitable for processing linear variables, while XGboost, multilayer perceptron (MLP), random forest (RF), naive bayes (NB) and SVM have strong nonlinear variable processing capabilities [22][23][24].In addition, XGboost has become one of the most successful algorithms in machine learning competitions, and has been widely used and achieved good results.
Kim et al. developed 55 machine learning models (RF, NNs, XGBoost, generalized linear model, etc.) to predict the needs of patients with COVID-19 for intensive care, and found that XGBoost model showed the highest recognition performance.The area under the receiver operating characteristic curve (AUROC) of XGBoost model in the development group is 0.897, and that in the validation group is 0.885.This model can effectively predict the demand for intensive care of patients with COVID-19 [25].Huang et al. used the traditional Cox proportional risk model and three machine learning models to construct and screen the best recurrence prediction model after resection of hepatocellular carcinoma for early monitoring and identification of high-risk patients with recurrence.The results showed that in the internal validation set, XGBoost model obtained the best discrimination with a C index of 0.713, which affirmed the value and role of XGBoost model in prediction [26].
Although the importance of XGBoost in clinical decision-making has been gradually recognized by clinicians.However, its value in predicting infection in patients with DC has not been reported.Therefore, we designed this study to develop an XGBoost model combining demographic characteristics, etiology, complications, and laboratory indicators to predict the risk probability of infection in patients with DC, and further compared the value of the XGBoost model with the prediction method based on the conventional LR.

Study design and patients
Clinical data of this study were obtained from five thirdlevel hospitals in southwest China.In this multicenter retrospective study, 6,648 of 10,689 DC patients with clinical consultation records met the quality standards for the final analysis.These patients were randomly divided into a training set with 4,353 samples and an internal validation set with 1,866 samples from hospitals A-D at a ratio of 7:3.A total of 429 samples from hospital E were used for external validation.The study adhered to the principles of the Declaration of Helsinki and the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis Guidelines [27].Clinical research ethics approval was obtained from the Ethics Committee of the Affiliated Banan Hospital of Chongqing Medical University (approval number: 2021-008).Individual patient-level consent was not required because the study only used fully de-identified collected data.

Diagnostic criteria
The diagnosis of DC is confirmed by liverbiopsy, clinical, biochemical, and imaging data or past medical records, and the diagnosis is in accordance with the "EASL Clinical Practice Guidelines for the management of patients with decompensated cirrhosis" [1].Infection was defined to include SBP, pneumonia, cellulitis, urinary system infection and spontaneous bacteremia, and (ii) a combination of microbial detection, clinical or laboratory signs of infection [28,29].

Inclusion and exclusion criteria
The inclusion criteria for this study were DC patients admitted between July 2012 and December 2021.Exclusion criteria were as follows: (i) age < 18 years, (ii) patients with cancer other than primary liver cancer, (iii) mental illness, (iv) pregnant and lactating women, and (v) variables with > 30% missing values.The detailed selection process is shown in Supplementary Fig. 1.

Data collection
On the basis of previous studies, 28 variables routinely tested or recorded were collected, which included age, sex, hypertension, diabetes, smoking, drinking, primary liver cancer, family history of liver disease, hepatitis B virus (HBV), hepatitis C virus (HCV), alcoholic, autoimmunity, gastrointestinal bleeding (GIB), ascites, hepatic encephalopathy (HE), hepatic failure (HF), total protein (TP), total bilirubin (TB), hemoglobin, blood sodium (Na), blood potassium (K), albumin (ALB), prothrombin activity (PTA), blood urea nitrogen (BUN), creatinine (Cr), red blood cell (RBC) count, white blood cell (WBC) count, and neutrophils to lymphocytes ratio (NLR).Considering that many features may have different values when measured at different time points, we only included the first measurement values of patients after their first admission in this study.

Statistical analysis
Statistical analysis was performed using SPSS 22.0 and R software (version 4.0.2,Vienna, Austria).Kolmogorov Smirnov Normality test was applied for quantitative data.Probability (P) values of > 0.05 were considered normal distribution.The data with a normal distribution were presented as the mean ± standard deviation and tested with t-test, whereas those with a non-normal distribution were described with the median (interquartile range, [IQR]) and tested with Mann-Whitney U test.The qualitative data were presented as n (%) and tested with χ 2 test.We used the R multivariate imputation by chained equation package for missing data imputation in this study.
In the model construction phase, we developed the LR and XGBoost algorithm models.First, the variables with statistical differences were identified through single factor analysis.Then the least absolute shrinkage and selection operator (LASSO) regression was used to further screen potential related variables.Finally, LR and XGBoost models were constructed to analyze the impact of each variable on the increased risk of infection in patients with DC.The hyperparameters of XGBoost were set as follows: eta = 0.3, max_depth = 5, subsample = 0.5, colsample_bytree = 1, gamma = 0.5.We defined this model as "multi-tree XGBoost" and the ranks of feature importance were then obtained [30].The correlation between the multi-tree XGBoost model's features was evaluated using Pearson correlation analysis.In order to further determine the most significant features related to infection risk in the unbalanced data, we conducted 100round 5-fold cross-validation in the training set.When the seventh feature was added in the XGBoost model, the increased AUROC was less than 0.5% (P = 0.158, Supplementary Fig. 2).Finally, six features were selected as significant predictors and defined the model as "simple-tree XGBoost".

Patient characteristics
The Mann-Whitney U test revealed that there was no significant difference in all missing variables in the training and internal validation sets before and after multiple imputations (Supplementary Table 1).Furthermore, there were no significant differences in all missing variables in the external validation set before and after multiple imputations (Supplementary Table 2).Table 1 summarizes the clinical characteristics of patients in the training and internal validation sets.No significant differences were observed in any of the variables between the two groups (P > 0.05).Patients in the training set were divided into infection and non-infection groups.Univariate analysis revealed that the following variables were significantly associated with infection: sex, hypertension, diabetes, smoking, drinking, primary liver cancer, alcoholic, autoimmunity, GIB, HE, HF, TP, TB, hemoglobin, Na, K, ALB, PTA, BUN, Cr, RBC count, WBC count, and NLR (Table 2).

Clinical features selection in LASSO regression analysis
Further, 22 features with statistical differences in univariate analysis were enter into the LASSO regression analysis, and 11 were significantly associated with infection, including GIB, HF, TP, TB, hemoglobin, Na, ALB, PTA, BUN, WBC count, and NLR (Fig. 1).
Figure 2 shows the correlation between these 11 features.There is a significant positive correlation between HF and TB (r = 0.53, P < 0.001), a significant positive correlation between TP and ALB (r = 0.53, P < 0.001), a significant negative correlation between HF and PTA (r=-0.55,P < 0.001), and a significant negative correlation between TB and PTA (r=-0.47,P < 0.001).

Construction and evaluation of XGBoost model
The aforementioned 11 features were entered into multitree XGBoost. Figure 3 shown the rank of their importance.Subsequently, we added the ranked features one by one to the XGBoost model until an AUROC score improving inferior to 0.5%.Six features, including TB, Na, ALB, PTA, WBC count and NLR were selected as the significant factors.Then a simple-tree XGBoost model was constructed based on the above six key features.
For the benchmark purpose, we also compared the performances of XGBoost model with the conventional multivariable LR model.In training set, the simple-tree XGBoost model with 6 selected features revealed superior performance compared to the LR with all 11 features (AUROC: 0.971 vs. 0.869, P < 0.001) or 6 features (AUROC: 0.971 vs. 0.864, P < 0.001) (Fig. 4).Table 3 shown the detailed performance metrics for the four models in training set.We have provided the formula details of the performance criteria in Supplementary Table 3.Similarly, in internal validation set, the simpletree XGBoost model exhibited better performance than the LR used by all 11 features (AUC: 0.998 vs. 0.878, P < 0.001) or the six selected features (AUC: 0.998 vs. 0.875, P < 0.001) (Supplementary Fig. 3).Supplementary Table 4 shown the detailed performance metrics for the four models in internal validation set.In the external validation set, the simple-tree XGBoost model by using   Table 5 shown the detailed performance metrics for the four models in external validation set.Briefly, the above results suggested that simple-tree XGBoost model owned more precise and stable prediction performance than multivariable LR in identifying infection outcome of patients with DC.In addition, we have substituted patients from different centers into the model and compared the diagnostic agreement.The results showed no significant difference between the AUROC of each center and the AUROC of all centers (Supplementary Table 6).

Discussion
A retrospective study of DC patients hospitalized in five third-level hospitals in southwest China showed that six characteristics, including TB, Na, ALB, PTA, WBC count and NLR were important predictors of the risk of infection in patients with DC.The simple-tree XGBoost model based on these six significant features shows good prediction performance.In training set, it had an AUROC of 0.971, sensitivity of 91.5%, specificity of 90.0%, PPV of 90.8%, and NPV of 90.7%.More and more studies have confirmed that it is convenient and effective to use laboratory biological indicators to build prediction models.Wang et al. established a prognosis model by combining conventional laboratory indicators with COVID-19 patients.The model based on the combination of neutrophils, lymphocytes, platelets and IL-2R showed good performance in predicting the death of COVID-19 patients.When the critical value was 0.572, the sensitivity and specificity of the prediction model were 90.74% and 94.44%, respectively [31].In a retrospective cohort study, the researchers used laboratory indicators such as hemoglobin, platelet count, white blood cell count, urea nitrogen, creatinine, glucose, sodium, potassium, and total bicarbonate to construct a multivariate LR model to predict in-hospital mortality of hospitalized patients.A good model calibration and fit were observed (Hosmer-Lemeshow = 13.9,P = 0.18) [32].The simple-tree XGBoost model constructed in this study can also provide a simple screening tool for medical providers in the primary health care setting, so as to quickly identifying patients at high risk of infection in a single visit.
In a study aimed at constructing a multivariate predictive model for SBP in patients with liver cirrhosis, researchers found that blood neutrophil percentage was a significant predictor of SBP [33].However, among the five indicators ultimately included in the prediction model, blood neutrophil percentage has the lowest importance compared to the other four indicators.Interestingly, in this study, NLR was the most important predictor for infection in DC patients, indicating that NLR's sensitivity in predicting infection seems to be superior to blood neutrophil percentage.In addition, in this study, all six features included in the simple-tree XGBoost model have appeared in other studies on constructing prediction model for infection in patients with liver cirrhosis, indicating that the six features selected in this study have high clinical practicality in predicting infection [34][35][36][37].
PTA is a classic index used to judge the severity of liver disease [38].Its sensitivity and specificity for various liver diseases are different in clinical evaluations, but a decrease in its level generally indicates that the liver function of the patients was damaged to different degrees.Llucia Tito et al. found that PTA was an independent predictor of liver cirrhosis complicated with SBP infection.In this study, a decreased PTA was found to be a risk factor for DC complicated with infection, and the risk of developing an infection would increase 0.04-fold when PTA decreased by 1% [39].Hypoalbuminemia is also an independent risk factor for infection in DC patients.The low level of ALB reflects that the patient's liver function and nutritional status are poor, the detoxification function of the body is reduced, and the ability to resist pathogenic bacteria is significantly reduced, which makes the patient prone to infection [40].TB and Na were also proved to be poor predictors of infection [41,42].
WBC count was another key predictor in the simpletree XGBoost model.WBC count is an important component of the body's defense system as a traditional indicator for detecting infectious diseases such as viruses and bacteria [43].Autoimmune disease, infection or septicemia can cause excessive consumption of granulocytes, resulting in granulocytopenia.During the diagnosis of infected patients, the detection of patients' WBC count can make a specific analysis of patients' inflammation; However, in some patients with non bacterial infection, WBC count in patients will also show constant changes due to the influence of external environment [44,45].Cheng et al. found that WBC count was an important risk factor for nosocomial bacterial infection in COVID-19 patients in tertiary hospitals.It is worth noting that compared with WBC count [(4.0 ~ 10.0) × 10 9 /L], patients with WBC count (> 10.0 × 10 9 /L or ≤ 4.0 × 10 9 /L) have a 7.38 fold increased risk of nosocomial bacterial infection [46].The study by Huang also demonstrated that WBC count (threshold > 10 × 10 9 /L) and procalcitonin to lactic acid ratio (threshold > 0.438) may help identify early stages of infection in patients with diabetic ketoacidosis, and combining these two markers may help with specificity [47].
NLR is a particularly interesting parameter.It is believed that liver cirrhosis has immune insufficiency, while neutrophils can reflect the immediate response of the body to inflammation, protect the body against bacterial infection [48][49][50], and lymphocyte level can reflect the immune level and nutritional status of the body.In patients with liver cirrhosis, the intestinal barrier is destroyed, intestinal flora changes, and pathogen-associated molecular patterns produced by bacteria, such as endotoxin, enter the blood circulation [51,52].Neutrophils can produce a large number of proinflammatory or anti-inflammatory cytokines, such as IL-6, IL-8, IL-17, when pathogen-associated molecular patterns and damage-associated molecular patterns are produced by liver cell necrosis.These cytokines in turn promote the activation of neutrophils [51].In the process of disease development, patients often have lymphocytopenia, which may be related to the increase of lymphocyte apoptosis in the process of inflammation [53].Therefore, NLR is an indicator that can reflect the overall immune status of the body.At the same time, a large number of studies have also confirmed that NLR can be used to evaluate the long-term or short-term prognosis of patients with stable or decompensated cirrhosis and cirrhosis with or without acute liver failure [48,[54][55][56].
In 2020, the annual per capita disposable income of rural households in China was approximately 17,132 yuan, which is approximately one-third of the income of urban households [57].Financial cost may be the leading barrier to screen DC patients for the risk of infection.Because of immune response dysfunction, infection poses a huge risk to patients with DC and indicates the beginning of the terminal phase of this disease, but the known risk factors have not fully clarified this relationship.Thus, it is important to minimize the number of variables in diagnostic tools as much as possible in medically underserved settings.The population with limited access to infection care may benefit from our simpletree XGBoost model, which was developed based on restricted medical resources and would not incur additional expenditures.
The advantage of this study is to use multicenter electronic medical record data to develop a infection prediction model.However, this study still has some limitations.First, due to retrospective research, the causal relationship between risk factors and infection should be carefully considered.Second, some important potential influencing factors were not included in this study because of significant data missing.Third, this study can only be regarded as a pilot study.More features and larger sample studies would be conducted to verify and improve the overall performance of the model in future.

Conclusion
Our study suggests that a simple predictive model could provide added value as an automated screening tool to DC patients for infection.We identified six candidate features, including TB, Na, ALB, PTA, WBC count and NLR measured at hospital admission, as critical infection risk biomarkers for DC patients.The simple-tree XGBoost model conducted by the six significant features can help to predict infection of DC patients with accurately > 95% precision and > 95% sensitivity.

Fig. 1
Fig. 1 Features selection by LASSO.(A) LASSO coefficients profiles (y-axis) of the 22 features.The upper x-axis is the average numbers of predictors and the lower x-axis is the log(λ).(B) 10-fold cross-validation for tuning parameter selection in the LASSO model