Skip to main content

Improved diagnostic efficiency of CRC subgroups revealed using machine learning based on intestinal microbes

Abstract

Background

Colorectal cancer (CRC) is a common cancer that causes millions of deaths worldwide each year. At present, numerous studies have confirmed that intestinal microbes play a crucial role in the process of CRC. Additionally, studies have shown that CRC can be divided into several consensus molecular subtypes (CMS) based on tumor gene expression, and CRC microbiomes have been reported related to CMS. However, most previous studies on intestinal microbiome of CRC have only compared patients with healthy controls, without classifying of CRC patients based on intestinal microbial composition.

Results

In this study, a CRC cohort including 339 CRC samples and 333 healthy controls was selected as the discovery set, and the CRC samples were divided into two subgroups (234 Subgroup1 and 105 Subgroup2) using PAM clustering algorithm based on the intestinal microbial composition. We found that not only the microbial diversity was significantly different (Shannon index, p-value < 0.05), but also 129 shared genera altered (p-value < 0.05) between the two CRC subgroups, including several marker genera in CRC, such as Fusobacterium and Bacteroides. A random forest algorithm was used to construct diagnostic models, which showed significantly higher efficiency when the CRC samples were divided into subgroups. Then an independent cohort including 187 CRC samples (divided into 153 Subgroup1 and 34 Subgroup2) and 123 healthy controls was chosen to validate the models, and confirmed the results.

Conclusions

These results indicate that the divided CRC subgroups can improve the efficiency of disease diagnosis, with various microbial composition in the subgroups.

Peer Review reports

Introduction

Colorectal cancer (CRC) is the third most common cancer in the world, killing millions of patients each year [1, 2]. CRC develops gradually through an adenomato-cancer process involving multiple complex physiological mechanisms, starting with the formation of sarcomas in the intestine and progressing to malignancy [3, 4]. This process can continue several years, and early screening for bowel cancer is a highly effective preventative measure [2]. The widely established early screening programs for bowel cancer in developed countries has been effectively reduced the occurrence of CRC [5,6,7]. With multiple mechanisms contribute to the occurrence of CRC, more and more studies have confirmed that intestinal microbes play an important role in the process [8,9,10,11,12]. For example, Fusobacterium nucleatum and Bacteroides fragilis are important marker species for the detection of early CRC, showing a positive correlation with CRC in several studies [11, 13, 14].

Studies of the tumor genome have verified that CRC can be further divided into different subtypes based on different consensus molecular subtypes (CMS). Four consensus subtypes (CMS1-CMS4) were defined by the Colorectal Cancer Subtyping Consortium (CRCSC) [15], which may influence clinical treatment and prognostic management. Morevoer, there is a correlation between CMS and gut microbes [16,17,18]. For example, CMS specific microbial profiles were identified from a cohort of 34 CRC patients, with Fusobacterium hwasookii and Porphyromonas gingivalis were highly enriched in CMS1, while Selenomas and Prevotella species were elevated in CMS2 [18]. These findings  suggest that a more detailed classification of CRC patients based on intestinal microbes may aid in the precision therapy [19]. Clustering is an unsupervised classification method to reveal the patterns in data, and K-Means and K-Medoids clustering are commonly used algorithms. Partitioning Around Medoids (PAM) algorithm is one of the most used algorithms(also simply referred to as K-Medoids) for its robustness, which was considered as a more robust version of K-Means [20, 21]. It is also reported that compared to K-means, which can be sensitive to the effects of outliers, PAM’s optimization goal is to minimize the sum of distances to the medoids instead of minimizing the sum of the squared distances to the cluster centers [22]. However, previous studies of intestinal microbes often overlook the aspect, focusing mainly on comparisons between CRC samples and  healthy controls [23, 24]. In this study, a Chinese CRC cohort (including 339 CRCs and 333 healthy controls) was selected, and the CRC samples were divided into two subgroups (234 Subgroup1 and 105 Subgroup2) based on the intestinal microbial composition using PAM algorithm. We assessed the differences in the microbial composition between the CRC subgroups, as well as the efficiency of the diagnostic models based on a random forest algorithm. An independent CRC cohort (187 CRCs and 123 healthy controls) was then selected to verify these results. Our findings indicated significant differences in gut microbiota  between the two CRC subgroups, and the efficiency of the diagnostic model was higher in CRC subgroups than without dividing CRC samples into subgroups.

Material and methods

Cohort description and study design

A total of 672 samples from one Chinese cohort was chosen as the discovery set for this study,  each with read numbers > 15000, including 339 CRC samples and 333 healthy control samples. The CRC samples were divided into two subgroups (234 Subgroup1 and 105 Subgroup2) using partitioning around medoids (PAM) clustering algorithm. The bacterial taxonomic composition was firstly compared between the two CRC subgroups, and disease-related markers were obtained in subgroups by comparing each subgroup with the healthy controls subsequently (Fig. 1). Then three CRC classifier models using random forest algorithm were constructed to differentiate CRC samples from healthy controls: (1) Subgroup1 vs healthy controls, (2) Subgroup2 vs healthy controls, (3) CRC vs healthy controls. The discovery set were randomly divided into a training phase (accounted for 70%) to build the models, and a testing phase (accounted for 30%) to verify the potential of the models. An independent cohort from China was also chosen to validate the results, including 187 CRC samples (dividing into 153 Subgroup1 and 34 Subgroup2) and 123 healthy control samples. For these samples, V3-V4 region of 16S rRNA gene was amplificated using 319F/806R primer, and Illumina MiSeq was used to generate 2 × 300 bp reads. Reads were downloaded from the SRA database with the accession number PRJNA763023 [25].

Fig. 1
figure 1

Study design and flow diagram. A total of 672 samples was chosen as discovery set, including 339 CRC samples and 333 healthy control samples. The CRC samples were divided into Subgroup1 and Subgroup2 using partitioning around medoids (PAM) clustering algorithm. The composition of bacterial taxonomic was firstly compared between the two subgroups and disease-related markers were obtained in subgroups by comparing each subgroup with the healthy controls subsequently. Then the random-forest classifiers (RFC) were built to discriminate CRC from health controls. The discovery set were randomly divided into training phase (accounted for 70%) to construct the classifier models based on random forest algorithm, and testing phase (accounted for 30%) to verify the potential of the models. Another cohort was selected as the independent validation set, including 187 CRC samples (153 Subgroup1 and 34 Subgroup2) and 123 healthy control samples

Data analysis

The NGS data of 16S rRNA gene sequencing were processed using QIIME2 platform (v2020.2) [26, 27] Sequencing reads were filtered and paired-reads were merged using Cutadapt plugin [28] and Vsearch plugin [29] separately. The feature table (operational taxonomic units, OTUs) construction were performed using Vsearch plugin with 97% identity (De novo clustering) from the merged reads, including dereplicating, and chimera filtering (using the Greengenes13_8 97% OTU database for reference). Taxonomic information was obtained using the classify-sklearn algorithm in feature-classifier plugin [30] with the SILVA database (v138.1) [31]. OTUs with few than 0.005% of the total sequences were removed to reduce the effect of spurious sequences. The OTUs were aligned using MAFFT function in alignment plugin [32] and the phylogenic tree was built using fasttree function in phylogeny plugin [33]. Alpha and beta diversity analyses were calculated using diversity plugin in QIIME2.

Statistical analysis

All the statistical analyses were performed using R (V.4.0.0). For the clustering of CRC samples, K-Medoids algorithm were used based on taxonomic profiles with pam() function from cluster package in R [20, 21]. Genus-level comparisons between groups was conducted using Mann–Whitney U-test [34]. Permutational Multivariate Analysis (PERMANOVA) was used to analyze the variance of the taxonomic profiles using adonis() function in R.

Linear discriminant analysis effect size (LEfSe, http://huttenhower.sph.harvard.edu/lefse/) analysis was used to identify differentially taxonomic features between cases and controls, using Kruskal–Wallis rank sum test (p-value < 0.05) and linear discriminant analysis (LDA > 2) [35]. Multivariate Association with Linear Models algorithm (MaAsLin2) was used to adjust the effects of age and gender in taxonomic analysis [36].

Random forest models (RF models, randomForest package (v4.6–14) in R) were built to differentiate CRC samples from healthy controls based on disease-related markers from LEfSe analysis [37, 38]. Receiver operating characteristic (ROC) curves were constructed and the area under curve (AUC) was calculated to evaluate the diagnostic performance of the model using pROC package (v1.17.0.1) [39]. Model efficiency were compared using roc.test() function.

Results

Two CRC subgroups with significant differences in microbial composition

Given the observed high degree of dissimilarity among mucosal bacterial communities, we investigated the clustering of fecal microbiomes in CRC samples. By applying partition around medoids clustering (PAM) algorithm on the genera abundance profile in the discovery cohort, we found that the bacterial communities of CRC samples converged into two clusters: 234 samples clustered in Subgroup1 and 105 samples clustered in Subgroup2 (Fig. 1, Fig. 2A). PERMANOVA analyses, Anosim analysis and Mrpp analysis indicated significant alteration between Subgroup1 and Subgroup2 (p-value = 0.001, p-value = 0.001 and p-value = 0.001 respectively). We then implemented Mann–Whitney U-test analyses to delineate the genera discriminating these two subgroups. Among the top25 genera in abundance, twelve genera were discerned between Subgroup1 and Subgroup2, among which 8 genera were significantly elevated in Subgroup1, whereas 4 genera were significantly elevated in Subgroup2 (Mann–Whitney U-test p-value < 0.05, Fig. 2B, Table S1).

Fig. 2
figure 2

Bacterial community clustering using partition around medoids (PAM) algorithm. A Principal coordinates analysis (PCoA) on the bacterial community structures of CRC subgroups. B Relative abundance of top25 genera among all sample, and * represents Mann–Whitney U-test p-value < 0.05. C Venn analysis of the two CRC subgroups based on genus level

To explore the differences of gut microbial composition between the two subgroups, venn analysis was performed based on genera profile. Among the 361 shared genera, 62 genera were significantly elevated in Subgroup1, while 67 genera were significantly elevated in Subgroup2 (Mann–Whitney U-test p-value < 0.05, Fig. 2C, Table S1). PERMANOVA analysis was performed to identify the factors responsible for these differences. The results indicated that patient age, nerve invasion, TNM stage, and vascular invasion significantly affected on the distribution of intestinal bacteria(p-value < 0.05), while gender had no significant effect (Table S2).

Altered alpha-diversity and overall microbial composition in CRC subgroups versus healthy controls

In the discovery cohort, compared with the healthy controls, gut microbial diversity were significantly different in Subgroup1 and Subgroup2, as measured by Shannon index and Observed_otus index (Fig. 3A, Figure S1A). However, Subgroup1 showed increased alpha diversity compared to healthy controls, while Subgroup2 exhibited reduced alpha diversity. We also observed the significant difference between Subgroup1 and Subgroup2, with the Shannon index decreasing in Subgroup2 (Mann–Whitney U-test p-value = 2e − 05). However, there was no obvious difference in the Observed_otus index between the two subgroups (Figure S1A).

Fig. 3
figure 3

Bacterial diversity and taxonomic analysis. A Shannon index of CRC subgroups and healthy control. P-value of two groups was calculated by Mann–Whitney U-test. B Venn analysis of the two CRC subgroups and healthy control based on genus level. C= PCoA analysis based on unweighted UniFrac matrix among CRC subgroups and healthy control, and p-value was calculated by PERMANOVA analysis

Moreover, venn analysis showed that 356 out of the total 746 genera were shared among the three groups, while 88 and 17 genera were unique to Subgroup1 and Subgroup2, respectively (Fig. 3B). To assess the overall diversity in microbial composition, we firstly calculated the beta diversity using weighted UniFrac, and PCoA analysis indicated significant differences in the microbial community among all samples (PERMANOVA analysis, Subgroup1 vs healthy controls: p-value = 0.001; Subgroup2 vs healthy controls: p-value = 0.001; Subgroup1 vs Subgroup2: p-value = 0.001) (Fig. 3C). The microbial community among the three groups were also significantly different based on unweighted UniFrac distance (PERMANOVA analysis, p-value < 0.05) and Bray–Curtis dissimilarity (PERMANOVA analysis, p-value < 0.05) (Figure S1B-C).

Bacteria differentially abundant in CRC subgroups versus healthy controls

To identify differentially abundant taxa in CRC, LEfSe analysis was performed by comparing the relative abundances between (1) Subgroup1 and healthy controls, (2) Subgroup2 and healthy controls, (3) Subgroup1 and Subgroup2. Forty-eight genera were found to be altered between Subgroup1 and healthy controls (LDA > 2 and p-value < 0.05, Fig. 4A). To further validate the results, a multivariate analysis (MaAsLin2) was performed to control the potential confounding factors including age and gender, and 39 out of 48 genera still passed the correction (Fig. 4A, Table S3). In the comparisons between Subgroup2 and healthy controls, 55 genera exhibited changed (LDA > 2 and p-value < 0.05), and 47 genera of which still remained statistically significant associations after multiple testing correction or covariate adjustments (MaAsLin2) (Fig. 4B, Table S4). Notably, lots of altered genera were observed between Subgroup1 and Subgroup2, with 45 genera showing significant differences (LDA > 2 and p-value < 0.05), with 37 of them remaining significant associations after multiple correction for the confounders (MaAsLin2) (Figure S2, Table S5).

Fig. 4
figure 4

Gut microbiota signatures in patients with CRC subgroups. (A-B) LEfSe analysis revealed the relative abundance of genera altered in Subgroup1(A) and Subgroup2(B) versus controls. (C-D) Venn diagram outlined the taxa signature associated with Subgroup1 and Subgroup2, respectively, as well as the taxa consistently altered in both CRC subgroups. # and † denote genera associated with Subgroup1 and Subgroup2 after using MaAsLin2 adjusting for age and gender, respectively

Next, we focused on the taxonomic signatures that significantly changed in one subgroup, but not in the other, when compared with healthy controls. In Subgroup1, the relative abundances of 26 genera were specifically different from those in healthy controls, including increased Hungatella, Sutterella, Flavonifractor, Bacteroides, and Lachnospiraceae_UCG-010 as well as reduced Dorea, Fusicatenibacter, Bifidobacterium, and [Eubacterium]_hallii_group (Fig. 4C,D). In Subgroup2, 33 genera exclusively altered, among which Pseudomonas, Bifidobacterium, Streptococcus, Haemophilus, and Clostridia_UCG-014 were over-represented, while the others such as Colidextribacter, Lachnoclostridium, Lachnospiraceae_UCG-010, Akkermansia, and Butyricimonas were under-represented (Fig. 4C,D).

Compared with healthy controls, Subgroup1 and Subgroup2 also showed a lot of overlaps in the altered taxa. In total, 22 genera were associated with both Subgroup1 and Subgroup2 (LDA > 2 and p-value < 0.05), including increased Parvimonas, Peptostreptococcus, Fusobacterium, and Gemella, as well as decreased Romboutsia, Ruminococcus, Megamonas, Faecalibacterium, Lachnospira, Butyricicoccus, and Blautia (Fig. 4C,D). In addition to these shared disease-associated genera in CRC subgroups, each CRC subgroup contains unique taxa, which may be relate to various mechanisms in CRC pathogenesis.

Gut microbiome-based signature discriminated CRC subgroups from healthy controls

To investigate the potential diagnostic value of gut microbial profiles in disease prediction, we constructed a random forest classifier model that could specifically identify disease samples from healthy controls. The key discriminatory bacterial taxa were selected as biomarkers based on LEfSe analysis (LDA > 2 and p-value < 0.05, Table S3,S4,S6). 

In the training phase (70% of the samples randomly selected from the discovery cohort), the model distinguishing CRC from healthy controls, based on 46 signature genera, achieved an area under the curve (AUC) value of 85.64% (95% CI: 82.28% − 89%). However, models based on 48 signature genera from Subgroup1 vs healthy controls and 55 signature genera from Subgroup 2 vs healthy controls performed better, with AUCs of 90.51% (95% CI: 87.61% − 93.41%) and 92.5% (95% CI: 89.32% − 95.69%), respectively (Fig. 5A, Figure S3A, D, G). Notably, the AUCs for Subgroup1 vs healthy controls and Subgroup2 vs healthy controls were significantly higher than the AUC for CRC vs healthy controls (roc.test, p-value < 0.05) (Fig. 5A). These results indicated that the clustered CRC samples could remarkably improve the diagnostic efficiency.

Fig. 5
figure 5

Disease classification models based on disease-associated markers. Random-forest classifiers (RFC) were built to discriminate CRC from health controls based on disease-associated taxa identified by LEfSe in the discovery cohort, including 339 CRC samples (234 Subgroup1 and 105 Subgroup2) and 333 healthy controls. The models were constructed in the training phase (70% of the samples in each group randomly selected from the discovery cohort), and the diagnosis efficacy were validated in the testing phase (the remaining 30% samples from the discovery cohort). Moreover, another cohort, including 187 CRC samples (153 Subgroup1 and 34 Subgroup2) and 123 healthy control samples, served as an independent external validation set to verify the potential of the classifiers. (A-B) The AUC values of disease classification models using random forest in training phase(A) and testing phase(B) from discovery cohort. (C) The AUC values in independent external validation phase. * suggested significant differences(p-value < 0.05) between models using roc.test() function, while NS suggested there was no significant difference(p-value > 0.05)

In the testing phase, the remaining 30% of the samples from the discovery cohort were used to validate the diagnostic efficacy. Using the bacterial taxa as predictors between patients and healthy controls, the models showed the best performance in Subgroup2 vs healthy controls (AUC: 93.81%, 95% CI: 89.98% − 97.63%), followed by Subgroup1 vs healthy controls (AUC: 92.15%, 95% CI: 88% − 96.3%), and CRC vs healthy controls (AUC: 84.8%,95% CI: 79.52% − 90.08%). The AUCs for  CRC subgroups were also significantly higher than for CRC without dividing into subgroups (Fig. 5B, Figure S3B,E,H), suggesting a notable improvement in diagnostic efficiency when models are based on Subgroup1 and Subgroup2.

In addition, to further verify the diagnostic potential, an independent external validation cohort from China (187 CRC samples including 153 Subgroup1 and 34 Subgroup2, and 123 healthy controls) was used to verify the potential of the classifiers. The models generated AUCs of 87.68% (95% CI: 83.6% − 91.76%), 86.82% (95% CI:80.16% − 93.49%) and 85.71% (95% CI: 81.42% − 90%) for Subgroup1 vs healthy controls, and Subgroup2 vs healthy controls, and CRC vs healthy controls, respectively, demonstrating their ability to effectively discriminate CRC patients from healthy controls (Fig. 5C, Figure S3C,F,I). These results highlighted the robust diagnostic efficiency of the RF models based on microbial markers. A summary of the AUCs with 95% CI is listed in supplemental Table S7.

Discussion

In this study, the PAM clustering algorithm was used to divide CRC samples into two different subgroups based on intestinal microbes (Fig. 1, Fig. 2A). PERMANOVA analysis showed that there was a significant difference between the two CRC subgroups (p-value < 0.05) (Fig. 2A), and Anosim analysis and Mrpp analysis confirmed the difference (p-value < 0.05). Factors such as patient age, nerve invasion, TNM stage, and vascular invasion had significant effects on these differences (PERMANOVA analysis p-value < 0.05) (Table S2).

Among the 25 genera with the highest abundance, 12 genera were significantly different between the two CRC subgroups (Fig. 2B, Table S1), including Bacteroides, Bifidobacterium, Streptococcus, Fusobacterium, Klebsiella. Among these significant genera, Fusobacterium deserves special attention, as it was reported promoting the process of CRC in several studies [11, 13, 14]. Venn analysis at the genus level showed that a total of 587  genera were detected in the CRC patients, with 361 genera could be detected in both subgroups, including 129 genera showed significant differences between the two subgroups (62 genera were enriched in Subgroup1, while 67 genera were enriched in Subgroup2) (Fig. 2C). In addition to the shared genera, each subgroup also contained unique genera: 185 genera were found exclusively  in Subgroup1, while 41 genera were unique to Subgroup2 (Fig. 2C). These results indicated distinct microbial compositions in Subgroup1 and Subgroup2.

Compared with healthy controls, the Shannon index was significantly higher in Subgroup1, while significantly decreased in Subgroup2 (p-value < 0.05) (Fig. 3A). In addition, the Shannon index in Subgroup1 was significantly higher than in Subgroup2 (Fig. 3A). Berbert et al. reported a significant increase in alpha diversity in CRC [40], while other studies had the opposite results. Venn analysis showed that 356 genera (47.6% of the total number of genera) were shared by the three groups, and 88 genera were unique in Subgroup1, while 17 genera were unique in Subgroup 2 (Fig. 3B). Furthermore, beta-diversity analysis revealed significant difference among the three groups (PERMANOVA, p-value < 0.05) (Fig. 3C, Figure S1A-B). These results indicated that the differences were not only in the two CRC subgroups, but also in CRC subgroups versus healthy controls. This underscored the importance of subdividing CRC based on the intestinal microbial characteristics, which may facilitate the precision therapy and health management.

LEfSe analysis was then used to find the key microbes which are responsible for these difference. Compared with healthy controls,  48 genera were significantly altered in Subgroup1 (Fig. 4A, Table S3), while 55 genera were significantly altered in Subgroup2 (Fig. 4B, Table S4). The two CRC subgroups had considerable overlaps in significant genera: 7 genera were significantly enriched in both subgroups, and all of them remained significantly different after adjusting for age and gender by MaAslin2 (Fig. 4C), including Parvimonas, Alistipes, Peptostreptococcus, Rothia, Granulicatella, Fusobacterium and Gemella. Fusobacterium has been confirmed by a series of studies, which could promote the occurrence and development of CRC [11]. Parvimonas and Peptostreptococcus  were also found to be enriched in CRC patients from a 16S rRNA research [41]. Although Fusobacterium is significantly enriched in both CRC subgroups, we should note that there was a notable difference in its abundance between the two subgroups (Fig. 2B, Figure S3), suggesting that these variations in Fusobacterium may lead to various disease states. Additionally, 15 genera were significantly decreased in CRC subgroups (Fig. 4D, Figure S4), including Faecalibacterium, [Eubacterium]_ventriosum_group, [Eubacterium]_eligens_group, Butyricicoccus, Blautia and Megamonas. Faecalibacterium, Blautia and Megamonas had been reported as key protective genera against CRC, as they were typically reduced in CRC patients [40, 42].

We then focused on those taxa that were altered in only one subgroup, but not in the other. There are 14 genera significantly enriched only in Subgroup1, including Hungatella, Sutterella, Flavonifractor and Bacteroides. Notably, Flavonifractor has been demonstrated as a potential marker for early CRC in  previous study [43]. In contrast,16 genera were uniquely enriched in Subgroup2, including Pseudomonas, Bifidobacterium, Streptococcus, Haemophilus (Fig. 4C). Additionally, twelve genera were significantly decreased in Subgroup1, including Anaerostipes, Dorea, Bifidobacterium and Haemophilus, while 17 genera were significantly decreased in Subgroup2 (Fig. 4D). It is noteworthy that Dorea, Erysipelatoclostridium, Bifidobacterium, Haemophilus and Anaerostipes were significantly decreased in Subgroup1, while were significantly enriched in Subgroup2. Dorea has previously been found to be more abundant in CRC patients than healthy controls [44], and has the ability to adhere to cancer cells [45]. Bifidobacterium, known for its probiotic effects, has been reported to be reduced in CRC patients, which plays a role in inhibiting the growth of CRC by reducing inflammation, uppressing angiogenesis, and enhancing the function of the intestinal barrier through the secretion of short-chain fatty acids (SCFAs) [41, 42, 46]. The elevated of Bifidobacterium in Subgroup2, compared to both Subgroup1 and the control group, suggest a potential link between these bacteria and the development of CRC in Subgroup2. Therefore, careful regulation of Bifidobacterium intake and cautious selection of probiotics may be necessary for treating Subgroup 2 patients. Gram-negative Haemophilus is responsible for the production of CDT which induces DNA damage, double-strand DNA breaks, mutations, and G2/M cell cycle arrest in the human colon cancer cells [47]. Meanwhile, Gram-postive Erysipelatoclostridium and its metabolite, ptilosteroid A, have been identified as potential diagnostic biomarkers for radiation-induced intestinal injury [48]. Anaerostipes had been reported to be significantly lower in CRC patients compared to healthy controls [49, 50]. These findings indicated that there may exist distinct pattens of microbes-disease associations in two CRC subgroups, and this should be considered in clinical treatment.

Last but not least, we used the random forest algorithm to construct a disease diagnostic model to evaluate the potential application in clinical diagnosis. First of all, samples from the discovery cohort were randomly divided into a training set (70% of the discovery cohort) for model construction and a testing set (the remained 30%) for internal verification. In the training phase, compared with the AUC (85.64%) for CRC vs healthy controls, AUCs for  the two CRC subgroups was significantly higher (p-value < 0.05). and the AUCs were 90.51% and 92.5% for Subgroup1 vs healthy controls and Subgroup 2 vs healthy controls, respectively (Fig. 5A, Figure S5 A,D,G, Table S7). In the internal validation phase, AUCs for  the two CRC subgroups were still significantly higher than for CRC vs healthy controls (p-value < 0.05), with AUCs of 92.15% for Subgroup1 vs healthy controls, 93.81% for Subgroup2 vs healthy controls and 84.8% for CRC vs healthy controls (Fig. 5B, Figure S5 B,E,H, Table S7). These results suggested that the diagnostic efficiency could be significantly improved through dividing CRC samples into two subgroups, and Subgroup2 had the highest diagnostic performance. To further validate the results, an independent cohort was selected for verification and the result showed that the efficiency in the two CRC subgroups (AUC: 87.68% for Subgroup1 vs healthy controls and 86.82% for Subgroup2 vs healthy controls) were again higher than that of CRC vs healthy controls (AUC: 85.71%) (Fig. 5C, Figure S5C,F,I, Table S7). These results confirm the importance of dividing CRC into subgroups based on gut microbial composition to improve diagnostic accuracy.

Conclusions

In conclusion, two microbiome-based CRC subgroups have been identified using PAM clustering algorithm, revealing significant differences in microbial compositions and abundance. The differences indicated distinct patterns of microbes-disease associations. In addition, the disease diagnostic models based on machine learning also showed that the efficiency were significantly improved when applied to the divided CRC subgroups. Therefore, distinguishing different intestinal microbes patterns in CRC patients is essential, as gut microbes not only contribute to CRC pathogenesis, but also account for partial CRC heterogeneity, which may have clinical utility for CRC screening, diagnosis and treatment. It is advisable to adopt individualized treatment plan according to the patient's condition, such as selecting suitable probiotics. In the future, integrating multi-omics approaches, including tumor genome and transcriptome data, should be employed to further elucidate the mechanisms and to explore the interactions between tumor patients and intestinal microbes.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Abbreviations

CRC:

Colorectal cancer

PAM:

Partitioning Around Medoids

CMS:

Consensus molecular subtypes

NGS:

Next generation sequencing

OTUs:

Operational taxonomic units

LEfSe:

Linear discriminant analysis effect size

RF models:

Random forest models

ROC:

Receiving operational curve

AUC:

Area under curve

SACAs:

Short-chain fatty acids

References

  1. Stoffel EM, Murphy CC. Epidemiology and mechanisms of the increasing incidence of colon and rectal cancers in young adults. Gastroenterology. 2020;158(2):341–53. Available from: https://pubmed.ncbi.nlm.nih.gov/31394082/. Cited 2022 Nov 1.

  2. Brenner H, Kloor M, Pox CP. Colorectal cancer. Lancet (London, England). 2014;383(9927):1490–502. Available from: https://pubmed.ncbi.nlm.nih.gov/24225001/. Cited 2022 Nov 1.

  3. Jones S, Chen WD, Parmigiani G, Diehl F, Beerenwinkel N, Antal T, et al. Comparative lesion sequencing provides insights into tumor evolution. Proc Natl Acad Sci U S A. 2008;105(11):4283–8. Available from: https://pubmed.ncbi.nlm.nih.gov/18337506/. Cited 2022 Nov 1.

  4. Fearon ER, Vogelstein B. A genetic model for colorectal tumorigenesis. Cell. 1990;61(5):759–67. Available from: https://pubmed.ncbi.nlm.nih.gov/2188735/. Cited 2022 Nov 1.

  5. Xi Y, Xu P. Global colorectal cancer burden in 2020 and projections to 2040. Transl Oncol. 2021;14(10). Available from: https://pubmed.ncbi.nlm.nih.gov/34243011/. Cited 2022 Nov 1.

  6. Shaukat A, Kahi CJ, Burke CA, Rabeneck L, Sauer BG, Rex DK. ACG Clinical guidelines: colorectal cancer screening 2021. Am J Gastroenterol. 2021;116(3):458–79. Available from: https://pubmed.ncbi.nlm.nih.gov/33657038/. Cited 2022 Nov 1.

  7. Díaz-Tasende J. Colorectal cancer screening and survival. Rev Esp Enferm Dig. 2018;110(11):681–3. Available from: https://pubmed.ncbi.nlm.nih.gov/30284905/. Cited 2022 Nov 1.

  8. Ashktorab H, Kupfer SS, Brim H, Carethers JM. Racial disparity in gastrointestinal cancer risk. Gastroenterology. 2017;153(4):910–23. Available from: https://pubmed.ncbi.nlm.nih.gov/28807841/. Cited 2022 Nov 1.

  9. Zhang F, Zhang Y, Zhao W, Deng K, Wang Z, Yang C, et al. Metabolomics for biomarker discovery in the diagnosis, prognosis, survival and recurrence of colorectal cancer: a systematic review. Oncotarget. 2017;8(21):35460–72. Available from: https://pubmed.ncbi.nlm.nih.gov/28389626/. Cited 2022 Nov 1.

  10. Dalal N, Jalandra R, Bayal N, Yadav AK, Harshulika, Sharma M, et al. Gut microbiota-derived metabolites in CRC progression and causation. J Cancer Res Clin Oncol. 2021;147(11):3141–55. Available from: https://pubmed.ncbi.nlm.nih.gov/34273006/. Cited 2022 Nov 1.

  11. Wong SH, Yu J. Gut microbiota in colorectal cancer: mechanisms of action and clinical applications. Nat Rev Gastroenterol Hepatol. 2019;16(11):690–704. Available from: https://pubmed.ncbi.nlm.nih.gov/31554963/. Cited 2022 Feb 18.

  12. Rebersek M. Gut microbiome and its role in colorectal cancer. BMC Cancer. 2021;21(1). Available from: https://pubmed.ncbi.nlm.nih.gov/34895176/. Cited 2023 Feb 27.

  13. Bullman S, Pedamallu CS, Sicinska E, Clancy TE, Zhang X, Cai D, et al. Analysis of Fusobacterium persistence and antibiotic response in colorectal cancer. Science. 2017;358(6369):1443–8. Available from: https://pubmed.ncbi.nlm.nih.gov/29170280/. Cited 2022 Nov 1.

  14. Castellarin M, Warren RL, Freeman JD, Dreolini L, Krzywinski M, Strauss J, et al. Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma. Genome Res. 2012;22(2):299–306. Available from: https://pubmed.ncbi.nlm.nih.gov/22009989/. Cited 2022 Nov 1.

  15. Guinney J, Dienstmann R, Wang X, De Reyniès A, Schlicker A, Soneson C, et al. The consensus molecular subtypes of colorectal cancer. Nat Med. 2015;21(11):1350–6. Available from: https://pubmed.ncbi.nlm.nih.gov/26457759/. Cited 2023 Feb 28.

  16. Ros J, Baraibar I, Martini G, Salvà F, Saoudi N, Cuadra‑Urteaga JL, et al. The evolving role of consensus molecular subtypes: a step beyond inpatient selection for treatment of colorectal cancer. Curr Treat Options Oncol. 2021;22(12). Available from: https://pubmed.ncbi.nlm.nih.gov/34741675/. Cited 2023 Feb 27.

  17. Rebersek M. Consensus molecular subtypes (CMS) in metastatic colorectal cancer - personalized medicine decision. Radiol Oncol. 2020;54(3):272–7. Available from: https://pubmed.ncbi.nlm.nih.gov/32463385/. Cited 2023 Feb 27.

  18. Purcell R V., Visnovska M, Biggs PJ, Schmeier S, Frizelle FA. Distinct gut microbiome patterns associate with consensus molecular subtypes of colorectal cancer. Sci Rep. 2017;7(1). Available from: https://pubmed.ncbi.nlm.nih.gov/28912574/. Cited 2023 Feb 27.

  19. de Souza JB, Brelaz-de-Castro MCA, Cavalcanti IMF. Strategies for the treatment of colorectal cancer caused by gut microbiota. Life Sci. 2022;290. Available from: https://pubmed.ncbi.nlm.nih.gov/34896161/. Cited 2023 Feb 27.

  20. Schubert E, Rousseeuw PJ. Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 2019;11807 LNCS:171–87. Available from: https://link.springer.com/chapter/https://doi.org/10.1007/978-3-030-32047-8_16. Cited 2023 Feb 27.

  21. Schubert E, Rousseeuw PJ. Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms. Inf Syst. 2020;101. Available from: http://arxiv.org/abs/2008.05171. Cited 2023 Feb 27.

  22. Shi Y, Zhang L, Peterson CB, Do KA, Jenq RR. Performance determinants of unsupervised clustering methods for microbiome data. Microbiome. 2022;10(1). Available from: https://pubmed.ncbi.nlm.nih.gov/35120564/. Cited 2024 Jul 17.

  23. Yu J, Feng Q, Wong SH, Zhang D, Yi Liang Q, Qin Y, et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut. 2017;66(1):70–8. Available from: https://pubmed.ncbi.nlm.nih.gov/26408641/. Cited 2023 Feb 27.

  24. Coker OO, Liu C, Wu WKK, Wong SH, Jia W, Sung JJY, et al. Altered gut metabolites and microbiota interactions are implicated in colorectal carcinogenesis and can be non-invasive diagnostic biomarkers. Microbiome. 2022;10(1). Available from: https://pubmed.ncbi.nlm.nih.gov/35189961/. Cited 2023 Feb 27.

  25. Yang Y, Du L, Shi D, Kong C, Liu J, Liu G, et al. Dysbiosis of human gut microbiome in young-onset colorectal cancer. Nat Commun. 2021;12(1). Available from: https://pubmed.ncbi.nlm.nih.gov/34799562/. Cited 2023 Feb 27.

  26. Bolyen E, Rideout JR, Dillon MR, Bokulich NA, Abnet CC, Al-Ghalith GA, et al. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nat Biotechnol. 2019;37(8):852–7. Available from: https://pubmed.ncbi.nlm.nih.gov/31341288/. Cited 2022 Feb 16.

  27. Liu G, Li T, Zhu X, Zhang X, Wang J. An independent evaluation in a CRC patient cohort of microbiome 16S rRNA sequence analysis methods: OTU clustering, DADA2, and Deblur. Front Microbiol. 2023;14. Available from: https://pubmed.ncbi.nlm.nih.gov/37560524/. Cited 2024 Jul 17.

  28. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17(1):10–2. Available from: https://journal.embnet.org/index.php/embnetjournal/article/view/200/479. Cited 2023 Feb 27.

  29. Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: a versatile open source tool for metagenomics. PeerJ. 2016;4(10). Available from: https://pubmed.ncbi.nlm.nih.gov/27781170/. Cited 2022 Nov 1.

  30. Bokulich NA, Kaehler BD, Rideout JR, Dillon M, Bolyen E, Knight R, et al. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2’s q2-feature-classifier plugin. Microbiome. 2018;6(1). Available from: https://pubmed.ncbi.nlm.nih.gov/29773078/. Cited 2022 Nov 1.

  31. Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, et al. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007;35(21):7188–96. Available from: https://pubmed.ncbi.nlm.nih.gov/17947321/. Cited 2022 Nov 1.

  32. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–80. Available from: https://pubmed.ncbi.nlm.nih.gov/23329690/. Cited 2023 Feb 27.

  33. Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5(3). Available from: https://pubmed.ncbi.nlm.nih.gov/20224823/. Cited 2023 Feb 27.

  34. Bauer DF. Constructing confidence sets using rank statistics. J Am Stat Assoc. 1972;67:687–90.

    Article  Google Scholar 

  35. Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, et al. Metagenomic biomarker discovery and explanation. Genome Biol. 2011;12(6). Available from: https://pubmed.ncbi.nlm.nih.gov/21702898/. Cited 2022 Feb 17.

  36. Mallick H, Rahnavard A, McIver LJ, Ma S, Zhang Y, Nguyen LH, et al. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput Biol. 2021;17(11). Available from: https://pubmed.ncbi.nlm.nih.gov/34784344/. Cited 2023 Feb 27.

  37. Liu Y, Zhao H. Variable importance-weighted Random Forests. Quant Biol (Beijing, China). 2017;5(4):338–51. Available from: https://pubmed.ncbi.nlm.nih.gov/30034909/. Cited 2022 Nov 2.

  38. Yachida S, Mizutani S, Shiroma H, Shiba S, Nakajima T, Sakamoto T, et al. Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nat Med. 2019;25(6):968–76. Available from: https://pubmed.ncbi.nlm.nih.gov/31171880/. Cited 2022 Feb 17.

  39. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12. Available from: https://pubmed.ncbi.nlm.nih.gov/21414208/. Cited 2023 Feb 27.

  40. Berbert L, Santos A, Magro DO, Guadagnini D, Assalin HB, Lourenço LH, et al. Metagenomics analysis reveals universal signatures of the intestinal microbiota in colorectal cancer, regardless of regional differences. Brazilian J Med Biol Res = Rev Bras Pesqui medicas e Biol. 2022;55. Available from: https://pubmed.ncbi.nlm.nih.gov/35293551/. Cited 2023 Feb 27.

  41. Chen W, Liu F, Ling Z, Tong X, Xiang C. Human intestinal lumen and mucosa-associated microbiota in patients with colorectal cancer. PLoS One. 2012;7(6). Available from: https://pubmed.ncbi.nlm.nih.gov/22761885/. Cited 2023 Feb 27.

  42. Chattopadhyay I, Dhar R, Pethusamy K, Seethy A, Srivastava T, Sah R, et al. Exploring the role of gut microbiome in colon cancer. Appl Biochem Biotechnol. 2021;193(6):1780–99. Available from: https://pubmed.ncbi.nlm.nih.gov/33492552/. Cited 2023 Feb 27.

  43. Yang Y, Du L, Shi D, Kong C, Liu J, Liu G, et al. Dysbiosis of human gut microbiome in young-onset colorectal cancer. Nat Commun. 2021;12(1). Available from: https://pubmed.ncbi.nlm.nih.gov/34799562/. Cited 2022 Feb 17.

  44. Hibberd AA, Lyra A, Ouwehand AC, Rolny P, Lindegren H, Cedgård L, et al. Intestinal microbiota is altered in patients with colon cancer and modified by probiotic intervention. BMJ open Gastroenterol. 2017;4(1). Available from: https://pubmed.ncbi.nlm.nih.gov/28944067/. Cited 2023 Feb 27.

  45. Ho CL, Tan HQ, Chua KJ, Kang A, Lim KH, Ling KL, et al. Engineered commensal microbes for diet-mediated colorectal-cancer chemoprevention. Nat Biomed Eng. 2018;2(1):27–37. Available from: https://pubmed.ncbi.nlm.nih.gov/31015663/. Cited 2023 Feb 27.

  46. Lin C, Cai X, Zhang J, Wang W, Sheng Q, Hua H, et al. Role of gut microbiota in the development and treatment of colorectal cancer. Digestion. 2019;100(1):72–8. Available from: https://pubmed.ncbi.nlm.nih.gov/30332668/. Cited 2023 Feb 27.

  47. Guidi R, Guerra L, Levi L, Stenerlöw B, Fox JG, Josenhans C, et al. Chronic exposure to the cytolethal distending toxins of Gram-negative bacteria promotes genomic instability and altered DNA damage response. Cell Microbiol. 2013;15(1):98–113. Available from: https://pubmed.ncbi.nlm.nih.gov/22998585/. Cited 2023 Feb 27.

  48. Cai S, Yang Y, Kong Y, Guo Q, Xu Y, Xing P, et al. Gut bacteria erysipelatoclostridium and its related metabolite ptilosteroid a could predict radiation-induced intestinal injury. Front public Heal. 2022;10. Available from: https://pubmed.ncbi.nlm.nih.gov/35419331/. Cited 2023 Feb 27.

  49. Peters BA, Dominianni C, Shapiro JA, Church TR, Wu J, Miller G, et al. The gut microbiota in conventional and serrated precursors of colorectal cancer. Microbiome. 2016;4(1):69. Available from: https://pubmed.ncbi.nlm.nih.gov/28038683/. Cited 2023 Feb 27.

  50. Mori G, Rampelli S, Orena BS, Rengucci C, De Maio G, Barbieri G, et al. Shifts of faecal microbiota during sporadic colorectal carcinogenesis. Sci Rep. 2018;8(1). Available from: https://pubmed.ncbi.nlm.nih.gov/29985435/. Cited 2023 Feb 27.

Download references

Acknowledgements

Not applicable.

Funding

The work is supported by grants from the Natural Science Basic Research Program of Shaanxi (grant number 2020JC-01).

Author information

Authors and Affiliations

Authors

Contributions

Jiayin Wang, Yanlei Ma and Guang Liu designed the research study. Guang Liu and Lili Su performed bioinformatics analysis. Guang Liu, Cheng Kong, and Liang Huang wrote the manuscript. Jiayin Wang, Yanlei Ma, Xuanping Zhang and Xiaoyan Zhu edited the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Yanlei Ma or Jiayin Wang.

Ethics declarations

Ethics approval and consent to participate

The data used in this study was downloaded from SRA database with the accession number PRJNA763023 from a published study [25]. In this article, Ethical approval was obtained from the Institutional Review Board of Fudan University Shanghai Cancer Center, and written informed consent was provided by all subjects before sampling.

Consent for publication

Not applicable.

Competing interests

Guang Liu and Lili Su are employed by Guangdong Hongyuan Pukang Medical Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

12876_2024_3408_MOESM1_ESM.pdf

Supplementary Material 1: FigureS1. Bacterial diversity and taxonomic analysis. (A) Observed_otus index CRC subgroups and healthy control. p-value of two groups was calculated by Mann-Whitney test. (B-C) PCoA analysis based on weighted UniFrac matrix(B) and Bray-Curtis dissimilarity(C) among CRC subgroups and healthy control, and p-value was calculated by PERMANOVA analysis.

12876_2024_3408_MOESM2_ESM.pdf

Supplementary Material 2: FigureS2. Gut microbiota signatures in patients with CRC subgroups. LEfSe analysis revealed the altered genera between Subgroup1 and Subgroup2. # represntes the altered taxa were adjusted for the age and gender using MaAsLin2.

12876_2024_3408_MOESM3_ESM.pdf

Supplementary Material 3: FigureS3. The enriched genera in Subgroup1 and Subgroup2 comparing with healthy controls. (A-G) Boxplot of altered genera between CRC subgroups and healthy controls, which were enriched in Subgroup1 and Subgroup2. Relative abundances were logarithmic-transformed and 0 values were assigned 1e-05. *p-value<0.05, **p-value<0.001, ***p-value<0.0001, ****p-value<0.00001 using Mann–Whitney U-test.

12876_2024_3408_MOESM4_ESM.pdf

Supplementary Material 4: FigureS4. The decreased genera in Subgroup1 and Subgroup2 comparing with healthy controls. (A-O) Boxplot of altered genera between CRC subgroups and healthy controls, which were decreased in Subgroup1 and Subgroup2. Relative abundances were logarithmic-transformed and 0 values were assigned 1e-05. *p-vlaue<0.05, **p-value<0.001, ***p-value<0.0001, ****p-value<0.00001 using Mann–Whitney U-test.

12876_2024_3408_MOESM5_ESM.pdf

Supplementary Material 5: FigureS5. Performance of the disease classification models was evaluated using AUCs. (A,B,C) the AUC values between Subgroup1 and Controls in the training phase(A), testing phase(B), and independent external validation phase(C). (D,E,F) the AUC values between Subgroup2 and Controls in the training phase(D), testing phase(E), and independent external validation phase(F). (G,H,I) the AUC values between CRC and Controls in the training phase(G), testing phase(H), and independent external validation phase(I).

Supplementary Material 6: TableS1. Mann-Whitney test of genera between Subgroup1 and Subgroup2.

Supplementary Material 7: TableS2. Associated factors with intestinal bacteria based on PERMANOVA analysis.

Supplementary Material 8: TableS3. LEfSe analysis between Subgroup1 and healthy controls.

Supplementary Material 9: TableS4. LEfSe analysis between Subgroup2 and healthy controls.

Supplementary Material 10: TableS5. LEfSe analysis between Subgroup1 and Subgroup2.

Supplementary Material 11: TableS6. LEfSe analysis between CRC and healthy controls.

Supplementary Material 12: TableS7. A list of AUC values with 95% CI.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, G., Su, L., Kong, C. et al. Improved diagnostic efficiency of CRC subgroups revealed using machine learning based on intestinal microbes. BMC Gastroenterol 24, 315 (2024). https://doi.org/10.1186/s12876-024-03408-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12876-024-03408-3

Keywords