Predicting metabolite response to dietary intervention using deep learning

0
Predicting metabolite response to dietary intervention using deep learning

Overview of McMLP

We hypothesized that in order to accurately predict post-dietary intervention metabolomic profiles, we first need to capture how the microbiome composition changes from the baseline to the endpoint. This is because metabolomic profiles reflect the microbial metabolism of a community7,42. To test our hypothesis, we proposed McMLP, which consists of two steps: (step-1) use the baseline microbiota and metabolome data (i.e., concentrations of targeted metabolites) and the dietary intervention strategy to predict the endpoint microbial composition; and (step-2) use the predicted endpoint microbial composition, the baseline metabolome data, and the dietary intervention strategy to predict the endpoint metabolomic profile (Fig. 2a; Supplementary Fig. 1a). For each step, we used a multilayer perceptron (MLP) with Rectified Linear Unit (ReLu) as the activation function to perform the prediction. We emphasize that, in principle, one can just use one MLP to directly predict endpoint metabolomic profiles based on baseline microbiota/metabolome data and the dietary intervention strategy (Supplementary Fig. 1b). Later, we confirmed that this one-step strategy has worse predictive power than our two-step strategy.

Fig. 2: The workflow of McMLP (Metabolite response predictor using coupled multilayer perceptrons).
figure 2

We aim to predict endpoint metabolomic profiles (i.e., metabolomic profiles after the dietary interventions) based on the baseline microbial compositions (i.e., microbial compositions before the dietary intervention), dietary intervention strategy, and baseline metabolomic profiles. Here we used a hypothetical example with n = 5 training samples and 2 samples in the test set. For each sample, we considered \({N}_{{{{\rm{s}}}}}\) microbial species, \({N}_{{{{\rm{d}}}}}\) dietary resources, and \({N}_{{{{\rm{m}}}}}\) metabolites. Across three panels, microbial species and their relative abundances are colored blue, dietary resources and their intervention doses are colored green, and metabolites and their concentrations are colored red. Icons associated with baseline/endpoint data are bounded by solid black/dashed lines respectively. a The model architecture of McMLP. McMLP comprises two coupled MLPs (multilayer perceptrons). The first MLP at the top (step 1) predicts the endpoint microbial compositions based on the baseline data and the dietary intervention strategy. The predicted endpoint microbial compositions from the first MLP are then provided as input to the second MLP at the bottom (step 2). The second MLP combines the predicted endpoint microbial compositions, the dietary intervention strategy, and the baseline metabolomic profiles to finally predict the endpoint metabolomic profiles. The value of dietary intervention strategy is either binary to denote the presence/absence of each dietary resource or numeric to be proportional to the intervention dose. Details of both MLPs can be found in Supplementary Fig. 1 and “Methods”. b McMLP takes two types of baseline data (baseline microbial compositions and baseline metabolomic profiles) and the dietary intervention strategy as input variables and is trained to predict corresponding endpoint metabolomic profiles. During training, the endpoint microbial composition is needed to train the first MLP. By contrast, the second MLP directly takes the predicted endpoint microbial composition instead of the actual endpoint microbial composition. c The well-trained McMLP can generate predictions for metabolomic profiles for the test set. During testing, no endpoint microbial composition is needed because the second MLP directly takes the predicted endpoint microbial composition from the first MLP as the input.

From a practical standpoint, our goal is to predict an individual’s metabolite response (i.e., the change in concentrations of the targeted metabolite) to a potential dietary intervention to facilitate precision nutrition. To achieve this goal, we feed the baseline microbiota and metabolome profiles of this individual and the potential dietary intervention strategy to a well-trained McMLP to predict the endpoint metabolome profile. Note that in this application (or test) stage, because the dietary intervention is a thought experiment, no real endpoint data is available. The first MLP in McMLP will predict the endpoint microbiota profile, which will be fed into the second MLP to predict the endpoint metabolome profile.

During the training stage of McMLP, we need to collect not only baseline microbiota and metabolome profiles of different individuals, but also perform dietary interventions to collect actual endpoint microbiota and metabolome profiles. We emphasize that the actual endpoint microbiota data will only be used to train the first MLP (Fig. 2b). It shall not be used to train the second MLP. This is because we need to keep the consistency between the training and test stages. After all, during the application stage, it is the predicted endpoint microbiome profile that will be fed into the second MLP, and the actual endpoint microbiome profile does not exist at all.

Instead of fine-tuning hyperparameters such as the number of layers \({N}_{{{{\rm{l}}}}}\) and the hidden layer dimension \({N}_{{{{\rm{h}}}}}\) for MLP, we overparameterized MLP by using a large and fixed number of layers \({N}_{{{{\rm{l}}}}}\) and hidden layer dimension \({N}_{{{{\rm{h}}}}}\) (\({N}_{{{{\rm{l}}}}}=6\) and \({N}_{{{{\rm{h}}}}}=2048\)). The overparameterized machine learning methods, especially deep learning models, yield better performance due to their high capacity (i.e., more model parameters). In fact, the high-capacity models can be even simpler due to smoother function approximation and thus less likely to overfit43.

To illustrate the prediction task, we used a hypothetical example comprising \({N}_{{{{\rm{s}}}}}(=5)\) microbial species, \({N}_{{{{\rm{d}}}}}(=3)\) dietary resources being intervened, \({N}_{{{{\rm{m}}}}}(=6)\) metabolites, and 7 samples (Fig. 2b, c). We will use both the baseline data and the dietary intervention strategy as inputs for McMLP (Fig. 2a). We used the Centered Log-Ratio (CLR)-transformed microbial relative abundances as the microbial composition and log10 transformed metabolite concentrations as the metabolomic profile. We did not impose the constraint that the predicted relative abundances from the first MLP add up to one. The value of dietary intervention strategy is either binary to denote the presence/absence of each dietary resource or numeric to be proportional to the intervention dose. 5 samples are used as the training set (Fig. 2b) and the remaining 2 samples form the test set (Fig. 2c). To evaluate the regression performance, we employed three metrics based on the Spearman correlation coefficient (SCC) \(\rho\) between the predicted and true values of the concentration of one metabolite across all samples: (1) \(\bar{\rho }\): the mean SCC, (2) \({f}_{\rho > 0.5}\): the fraction of metabolites with \(\rho\) greater than 0.5, and (3) \({\bar{\rho }}_{5}\): the mean SCC of the top-5 best-predicted metabolites.

McMLP generates superior performance over existing methods on synthetic data

To validate the predictive power of McMLP, we applied it to synthetic data generated from the Microbial Consumer-Resource Model (MiCRM) which considers microbial interactions through both nutrient competition and metabolic cross-feeding44. We adapted MiCRM to simulate the dietary intervention. For simplicity, we considered 20 food resources, 20 microbes, and 20 metabolites in the modeling. Also, we assumed that food resources can only be consumed while metabolites can be either consumed or produced. Prior to the dietary intervention, one food resource (referred to as “food resource #1”) was not introduced, while the remaining 19 food resources were supplied. Dietary intervention was simulated by adding food resource #1 at a specific “dose” to microbial communities composed of surviving species before the dietary intervention and calculating the new ecological steady state. Here, the “dose” is defined as the ratio between the amount of the introduced food resource during the dietary intervention and the average amount of other food resources introduced before the dietary intervention. We split the synthetic data (with 250 samples) with 80/20 ratio fifty times to generate fifty train-test pairs that can be used to reflect the variation in predictive performance. Details on model simulation and synthetic data generation can be found in the Supplementary Information.

We compared the performance of McMLP with two classical methods (GBR: Gradient-Boosting Regressor3; RF: Random Forest4,5) in the prediction task defined in Fig. 2. For each method, we considered two sets of input variables: (1) without baseline metabolomic profiles (denoted as “w/o b” hereafter) and (2) with baseline metabolomic profiles (denoted as “w/ b” hereafter).

We first used the three metrics (\(\bar{\rho }\), \({f}_{\rho > 0.5}\), \({\bar{\rho }}_{5}\)) to benchmark the predictive performance of the different methods on synthetic data with 50 training samples and an intervention dose of 3. We found that McMLP generated the best performance (Fig. 3a1-a3), especially when baseline metabolomic profiles were included in the input. When we predict without baseline metabolomic profiles, McMLP is significantly better than RF and GBR (p value < 0.05 for 5/6 comparison cases, Wilcoxon signed-rank test applied; McMLP yields the highest \(\bar{\rho }\) of \(0.391\pm 0.008\), the highest \({f}_{\rho > 0.5}\) of \(0.197\pm 0.018\), and the highest \({\bar{\rho }}_{5}\) of \(0.536\pm 0.007\); the standard error is used to measure the variation in performance metrics across 50 train-test splits). Including baseline metabolomic profiles in the input significantly improves the performance of all methods, with McMLP still being the best (which yields the highest \(\bar{\rho }\) of \(0.595\pm 0.005\), the highest \({f}_{\rho > 0.5}\) of \(0.815\pm 0.014\), and highest \({\bar{\rho }}_{5}\) of \(0.715\pm 0.006\)). We also tried to introduce 5 food resources during the dietary intervention (instead of 1 previously; see Supplemental Information for details) and found that the performance of McMLP is still superior to other methods when the dietary intervention strategy is more complex (Supplementary Fig. 2).

Fig. 3: McMLP provides better predictive power than previously developed computational methods for predicting endpoint metabolomic profiles on synthetic data generated from microbial consumer-resource models.
figure 3

Three computational methods are compared: Random Forest (RF), Gradient Boosting Regressor (GBR), and McMLP. For each method, we either included (“w/ b” label) or did not include (“w/o b” label) baseline metabolomic profiles as input variables. Each method with a particular combination of input data is colored the same way in all panels. Standard errors are computed based on fifty random train-test splits and shown in all panels (as solid black vertical lines or transparent areas around their means). To compare different methods, we adopted three metrics: the mean Spearman Correlation Coefficient (SCC) \(\bar{\rho }\), the fraction of metabolites with SCCs greater than 0.5 (denoted as \({f}_{\rho > 0.5}\)), and the mean SCC of the top-5 predicted metabolites \({\bar{\rho }}_{5}\). Error bars denote the standard error (n = 50). a1-a3, For the synthetic data with an intervention dose of 3 and 50 training samples, McMLP provides the best performance for all three metrics regardless of whether the baseline metabolomic profiles are included or not. b1-b3, When the intervention dose is 3, the predictive performance of all methods gets better and closer to each other as the training sample size increases. Including baseline metabolomic profiles also helps to improve the prediction. c1-c3, When 200 training samples are used, the performance gap between including and not including baseline metabolomic profiles shrinks as the intervention dose increases. All statistical analyses were performed using the two-sided Wilcoxon signed-rank test. P values obtained from the test are divided into four groups: (1) \(p > 0.05({{{\rm{n}}}}.{{{\rm{s}}}}.)\), (2) \(0.01 < p\le 0.05(*)\), (3) \({10}^{-3} < p\le 0.01(*\ast )\), and (4) \({10}^{-4} < p\le {10}^{-3}(*\ast*)\). Source data of raw data points and p values are provided as a Source Data file.

We further examined the effect of training sample size on model performance. While maintaining the same 50-sample test set used previously, we found that all performance metrics for all methods improved as the training sample size increased (Fig. 3b1–b3). More importantly, we found that the performance of McMLP is better than RF and GBR at small training sample sizes (20 or 50) and is close to RF and GBR at large training sample sizes (>50). This demonstrates the superior performance of McMLP with a limited number of samples, contrary to the traditional notion that deep learning methods tend to overfit at small sample sizes45.

We also examined the effect of intervention dose on model performance. By varying the concentration of the intervened food resource in MiCRM, we generated synthetic data with different intervention doses and subsequently trained all ML methods on them with 200 training samples. We found that the performance gap between methods using and not using baseline metabolomic profiles narrows as the intervention dose increases (Fig. 3c1–c3). We believe this is because a larger intervention dose significantly changes the endpoint metabolomic profile away from its baseline level, rendering the baseline metabolomic profile less useful.

Different from the above-mentioned benchmarking method where training data overlapped across train-test splits, we explored the impact of non-overlapping training data on our benchmarking results. To explore this, we created one independent synthetic dataset for each training and utilized the same, separate dataset as the test set (with 100 samples) for the performance evaluation across all repeats. Based on this new benchmarking protocol, we have benchmarked the performance of all algorithms and once again revealed the amazing predictive performance of McMLP (Supplementary Fig. 3).

McMLP accurately predicts metabolite responses on real human gut microbiota data

After validating McMLP using synthetic data, we analyzed real data from six dietary intervention studies to see if its performance on real data was consistently better than existing methods. The first dataset we collected was from a study investigating how avocado consumption alters gut microbial compositions and concentrations of fecal metabolites such as SCFAs and bile acids28. In this study all participants were divided into two groups based on the food components of the meals provided: (1) avocado group: 175 g (men) or 140 g (women) of avocado was provided as part of a meal once a day for 12 weeks and (2) control group: no avocado was included in their control meal28. Baseline (i.e., before the dietary intervention) and endpoint (i.e., during week 12 of the intervention) microbial compositions and concentrations of SCFAs and bile acids were quantified. The dataset is unique due to its relatively large sample size (66 for both avocado and control groups)28 compared to other dietary intervention studies27,32,34.

Because the amount of avocado consumed by participants in the avocado group was very similar and participants in the control group barely consumed avocado, for simplicity, we encoded the participant’s dietary intervention in McMLP and other methods as a binary variable in the input (green icons/symbols representing diets in Fig. 2) whose value equals 1 or 0 if the participant is in the avocado or control group, respectively. Note that in this study the concentrations of fecal SCFAs and bile acids were obtained from two separate targeted metabolomic assays. Hence, we separated the concentration prediction of SCFAs and bile acids to compare the predictability of the two metabolite classes. We found that for the concentration prediction of both SCFAs and bile acids, McMLP with the baseline metabolomic profiles consistently produces the best performance (Fig. 4a1-a3, b1-b3). Interestingly, the inclusion of baseline metabolomic profiles in the input of McMLP helps more with the prediction of bile acid concentrations than with the prediction of SCFA concentrations (\(\bar{\rho }\) increases from 0.182 to 0.346 for bile acids when metabolomic profiles are included; \(\bar{\rho }\) increases from 0.260 to 0.262 for SCFAs when metabolomic profiles are included). A potential explanation is that the correlation of SCFA concentrations between baseline and endpoint samples is weaker than that of bile acids (Supplementary Fig. 4).

Fig. 4: McMLP is superior to previous methods in terms of predicting endpoint metabolomic profiles on real data from six dietary intervention studies.
figure 4

Three computational methods are compared: Random Forest (RF), Gradient Boosting Regressor (GBR), and McMLP. For each method, we either included (“w/ b” label) or did not include (“w/o b” label) baseline metabolomic profiles as input variables. Each method with a particular combination of input data is colored the same in all panels. Standard errors are computed based on fifty random train-test splits and shown in all panels (solid black vertical lines). To compare different methods, we adopted three metrics: the mean Spearman Correlation Coefficient (SCC) \(\bar{\rho }\), the fraction of metabolites with SCCs greater than 0.5 (denoted as \({f}_{\rho > 0.5}\)), and the mean SCC of the top-5 predicted metabolites \({\bar{\rho }}_{5}\). Error bars denote the standard error (n = 50). a1-a3, Comparison of the performance in predicting SCFAs on the data from the avocado intervention study28. b1-b3, Comparison of performance in predicting bile acids on the data from the avocado intervention study28. c1-c3, Comparison of predictive performance on the data from the grain intervention study39. d1-d3, Comparison of predictive performance on the data from the walnut intervention study27. e1-e3, Comparison of predictive performance on the data from the almond intervention study40. f1-f3, Comparison of predictive performance on the data from the broccoli intervention study41. g1-g3, Comparison of predictive performance on the data from the high-fiber food or fermented food intervention study34. All statistical analyses were performed using the two-sided Wilcoxon signed-rank test. P values obtained from the test are divided into four groups: (1) \(p > 0.05\) (n.s.), (2) \(0.01 < p\le 0.05\) (*), (3) \({10}^{-3} < p\le 0.01\) (**), and (4) \({10}^{-4} < p\le {10}^{-3}\) (***). Source data of raw data points and p values are provided as a Source Data file.

We checked the predictive performance of the one-step strategy that uses the same number of layers and nodes as one step in McMLP (\({N}_{{{{\rm{l}}}}}=6\) and \({N}_{{{{\rm{h}}}}}=2048\) in Supplementary Fig. 1b), finding that it is not as good as that of McMLP (Supplementary Fig. 5). It is worth noting that augmenting the one-step approach with additional data types through the two-step McMLP does not automatically guarantee enhanced predictive performance. The utility of the additional data hinges on its relevance and the model’s capacity to utilize it efficiently. Despite these potential uncertainties, we believe the enhanced performance of McMLP could be attributed to its two-step approach. This method allows for an initial capture of the endpoint microbial composition, presumably better associated with the endpoint metabolite concentrations. This may also explain why McMLP outperforms RF4,5 and GBR3, which employ a one-step approach and do not leverage the endpoint microbial compositions during method training. We also compared McMLP with the state-of-art method of predicting metabolomic profiles from microbial compositions measured at the same time — mNODE38, finding that it has a worse performance than McMLP (Supplementary Fig. 6). The worse performance of mNODE is likely due to the fact that it is not dedicated to predicting metabolomic profiles at different time points. More technical reasons can be found in the Supplementary Information.

We extended the method comparison to five additional datasets from independent dietary studies investigating how microbiota compositions and fecal metabolite concentrations were influenced by adding grains39, walnuts27, almonds40, broccoli41, and high-fiber or fermented foods34 (the number of fecal microbes and metabolites as well as the types of metabolites are summarized in Table 1; see Methods section for details of the studies). Each participant’s dietary intake was similarly encoded as either a binary variable or a vector whose value is proportional to the consumed amount of the added dietary component, depending on the complexity of the dietary intervention. Further details of the data processing and model architecture setup can be found in the Supplementary Information. As shown in Fig. 4, McMLP consistently produces the best performance across all datasets (p value < 0.05 for 47/84 comparison cases, Wilcoxon signed-rank test applied). The relatively poor performance of all methods on the data from the study that investigated fibers and fermented foods34 is likely due to the fact that a variety of foods within the fiber and fermented foods categories were consumed by the participants at will, while other studies were complete feeding trials34.

Table 1 Summary of key features of dietary intervention studies used in our method comparison. ASVs: Amplicon Sequence Variants

We noticed that the predictive performance of McMLP on real data is worse than that in synthetic data. We believe the observed discrepancy in predictive performance between the synthetic and real data may be due to the influence of human host, such as host metabolism46 and health status47. While \(\bar{\rho }\) appears to be low ( ~ 0.2 to 0.4), the top-5 best-predicted metabolites for each dataset have great predictability, likely due to their strong association with the gut microbiome (Supplementary Fig. 7). We also compared the predictive performance of McMLP with that of a simple MLP with one hidden layer with everything else the same as in McMLP, finding that McMLP generates better performance (Supplementary Fig. 8).

We also explored whether incorporating covariates in the metadata can help further improve the predictive performance of McMLP. We only obtained the covariates for the avocado intervention study. For the avocado dataset, we have three covariates: gender, BMI, and age. We included these three covariates as additional variables in McMLP, finding that the incorporation of covariates significantly improves the predictive performance for most cases (Fig. 5). We also analyzed the permutation feature importance of the three covariates by shuffling the values of a covariate in the input and then measuring the reduction in the average Spearman Correlation Coefficients \(\bar{\rho }\). We found that all three covariates are important, except that gender is slightly less important than age when predicting the SCFAs (Supplementary Fig. 9).

Fig. 5: Including the covariates in metadata (age, BMI, and gender) in the input of McMLP improves it in terms of predicting endpoint metabolomic profiles on real data from the avocado intervention study.
figure 5

All results are derived from McMLP. We either included (“w/ b” label) or did not include (“w/o b” label) baseline metabolomic profiles as input variables. Each method with a particular combination of input data is colored the same in all panels. Standard errors are computed based on fifty random train-test splits and shown in all panels (solid black vertical lines). To compare different methods, we adopted three metrics: the mean Spearman Correlation Coefficient (SCC) \(\bar{\rho }\), the fraction of metabolites with SCCs greater than 0.5 (denoted as \({f}_{\rho > 0.5}\)), and the mean SCC of the top-5 predicted metabolites \({\bar{\rho }}_{5}\). Error bars denote the standard error (n = 50). a1-a3, Comparison of the performance in predicting SCFAs on the data from the avocado intervention study28. b1-b3, Comparison of performance in predicting bile acids on the data from the avocado intervention study28. All statistical analyses were performed using the two-sided Wilcoxon signed-rank test. P values obtained from the test are divided into four groups: (1) \(p > 0.05\) (n.s.), (2) \(0.01 < p\le 0.05\) (*), (3) \({10}^{-3} < p\le 0.01\) (**), and (4) \({10}^{-4} < p\le {10}^{-3}\) (***). Source data of raw data points and p values are provided as a Source Data file.

We wonder if the predictive performance of McMLP can be enhanced if we use the functional profiles generated from the whole-metagenome shotgun (WMS) sequencing instead of the microbial compositions derived from the 16S rRNA gene sequencing. To test this, we leveraged the available WMS sequencing data for a subset of samples in the avocado study. In the end, only 45 individuals have paired baseline-endpoint data. Their functional profiles are represented by 375 pathway features (see Methods section for details). For the 45 paired baseline-endpoint data, we compared the predictive performance among three different input data types: (1) microbial compositions, (2) functional profiles, and (3) combining both microbial compositions and functional profiles. The performance comparison of the three different input data types yields no significant difference (Supplementary Figs. 10, 11).

For the avocado dataset, we also grouped the ASV (Amplicon Sequence Variants) compositions from the 16S rRNA gene sequencing and the species-level microbial compositions from the WMS sequencing to the genus level. When analyzing the 16S sequencing data, predictions using the ASV-level compositions are generally more accurate than those using the genus-level compositions (Supplementary Fig. 12). For SCFAs, the predictive performances based on two types of compositions are comparable. Regarding the WMS data, we observed that predictions using the species-level compositions are slightly better than those using the genus-level compositions (Supplementary Fig. 13).

Inferring the tripartite food-microbe-metabolite relationship

It has been previously shown that an individual’s metabolite response depends on her/his gut microbial composition7,42,48. If we want to introduce a new dietary resource to boost the concentration of a health-beneficial metabolite mediated by gut microbes, we need to target “key” microbial species that meet two criteria: (1) the species can consume one or more dietary components in the introduced food resource; (2) the species can increase the metabolite concentration. If either criterion is not met, it is difficult to boost the metabolite concentration via this dietary intervention. Specifically, we identify these “key” species that satisfy both criteria by revealing the food-microbe consumption and microbe-metabolite production patterns, which can be summarized in a tripartite food-microbe-metabolite graph (Supplementary Fig. 14). To achieve this, we performed the sensitivity analysis of McMLP. In particular, we interpreted a potential relationship between an input variable \(x\) and an output variable \(y\) by perturbing \(x\) by a small amount (denoted as \(\Delta x\)) and then measuring the response of \(y\) (denoted as \(\Delta y\)). Following the notion of sensitivity in engineering sciences, we defined sensitivity \(s=\frac{\Delta y}{\Delta x}\) and used its sign (positive/negative) to reflect whether \(y\) changes in the same/opposite direction as \(x\). More technical details of this calculation can be found in the Methods section or in our previous study38.

We calculated sensitivities for step-1 (and step-2) in McMLP to infer potential food-microbe consumption (and microbe-metabolite production) interactions, respectively (Fig. 6a). Specifically, in step-1, we perturbed the amount of food resource \(\alpha\) and measured the change in the relative abundance of species \(i\). The sensitivity of species \(i\) to food resource \(\alpha\) is \({{{{\rm{s}}}}}_{i\alpha }=\frac{\Delta {y}_{i}}{\Delta {x}_{\alpha }}\) and its sign can be used to reflect the interaction between species \(i\) and food resource \(\alpha\). \({{{{\rm{s}}}}}_{i\alpha } > 0\), indicates that species \(i\) can consume some nutrient components of food resource \(\alpha\). Similarly, for step-2, we define the sensitivity of metabolite \(\beta\) to species \(i\) as \({{{{\rm{s}}}}}_{\beta i}=\frac{\Delta {y}_{\beta }}{\Delta {x}_{i}}\). The positive sensitivity, \({{{{\rm{s}}}}}_{\beta i} > 0\), reveals potential production of the metabolite \(\beta\) by species \(i\).

Fig. 6: Applying sensitivity analysis of McMLP accurately infers food-microbe consumption interactions and microbe-metabolite production interactions in both synthetic and real data.
figure 6

a The sensitivity of the relative abundance of species \(i\) to the supplied dietary resource \(\alpha\) is denoted as s. It is defined as the ratio between the change in the relative abundance of species \(i\) (\({\Delta y}_{i}\)) and a small perturbation in the supplied dietary resource \(\alpha\) (\({\Delta x}_{\alpha }\)). Similarly, the sensitivity of the concentration of metabolite \(\beta\) to the relative abundance of species \(i\) is denoted as sβi. It is defined as the ratio between the change in the concentration of metabolite \(\beta\) (\({\Delta y}_{\beta }\)) and the perturbation in the relative abundance of species \(i\) (\({\Delta x}_{i}\)). b The sensitivity values for food-microbe consumption interactions (colored in green) and microbe-metabolite production interactions (colored in red) in the synthetic data. c The ground-truth food-microbe consumption rates (colored in green) and microbe-metabolite production rates (colored in red) in the synthetic data. d The Area Under the Receiver Operating Characteristic (AUROC) curve based on True Positive (TP) rates and False Positive (FP) rates which are obtained by using different sensitivity thresholds to classify interactions. e The sensitivity values for avocado-microbe consumption interactions (colored in green) and microbe-metabolite production interactions (colored in red) for the real data from the avocado intervention study. f The avocado-microbe-butyrate tripartite graph constructed based on the sensitivity values of avocado-microbe consumption interactions and microbe-butyrate production interactions for the real data from the avocado intervention study. The edge width and edge arrow sizes are proportional to the absolute values of the sensitivities. All microbes in the middle layer are arranged from left to right in the increasing order of the incoming edge width multiplied by the outgoing edge width. Source data are provided as a Source Data file.

We first evaluated our sensitivity method on the synthetic data for which we know the ground truth of food-microbe consumption and microbe-metabolite production interactions. We found that the inferred sensitivity values for all food-microbe and microbe-metabolite pairs (Fig. 6b) have a zero-nonzero pattern very similar to the ground-truth consumption and production rates assigned in MiCRM (Fig. 6c). We chose zero as the sensitivity threshold and kept only positive values for food-microbe pairs (green cells in Fig. 6b, c) and for microbe-metabolite pairs (red cells in Fig. 6b, c) to explore consumption and production interactions respectively. To statistically verify the agreement between ground-truth interactions and inferred interactions based on sensitivity values, we computed the AUROC (Area Under the Receiver Operating Characteristic curve) based on the overlap between true and predicted interactions when the classification threshold is varied. More specifically, for each classification threshold \({s}_{{\mbox{thres}}}\), we predicted the consumption of food resource \(\alpha\) by species \(i\) (or production of metabolite \(\alpha\) by species \(i\)) to be true only if \({{{{\rm{s}}}}}_{i\alpha } > {s}_{{\mbox{thres}}}\) (or \({{{{\rm{s}}}}}_{\alpha i} > {s}_{{\mbox{thres}}}\)). We achieved excellent performance in inferring either food-microbe consumption interactions (green line and dots with AUROC = 0.9 in Fig. 6d) or microbe-metabolite production interactions (red line and dots with AUROC = 0.92 in Fig. 6d).

We then performed the same inference on real data from the avocado study28. The results are shown in Fig. 6e (Inference results of other studies provided in the Supplementary Data). Our results shown in Fig. 6e are in agreement with prior biological knowledge that Faecalibacterium prausnitzii is a stronger producer of butyrate49 than Ruminococcus callidus, and R. calidus is a stronger producer of acetate than F. prausnitzii50,51.

The inference results also enable us to construct the tripartite food-microbe-metabolite graph. For the sake of simplicity, here we visualize the avocado-microbe-butyrate subgraph (Fig. 6f). Note that increased butyrate levels have been shown to be beneficial to host health by enhancing immune status19,20,21. For the avocado-microbe-butyrate subgraph, we focused on the top-20 avacado-microbe consumption and top-20 microbe-butyrate production interactions ranked by their absolute sensitivity values. Only nodes and links associated with these interactions were shown in this subgraph. Widths of individual edges in this figure are proportional to the absolute values of the corresponding sensitivities and node sizes for microbes are proportional to the products of edge widths connecting this microbe to avocado at the top and butyrate at the bottom of this subgraph. We ordered microbial nodes in the middle layer in the increasing order of node sizes from left to right (Fig. 6f). This organization helps us identify the key species that serve as both strong consumers of avocado and strong producers of butyrate. F. prausnitzii emerged as the most important key species for butyrate production in response to avocado intervention. Our results are consistent with previous studies49. For example, F. prausnitzii levels have been previously shown to be elevated when avocado is supplied by diet52. In a separate study, F. prausnitzii has also been shown to produce butyrate as a metabolite byproduct49.

link

Leave a Reply

Your email address will not be published. Required fields are marked *