You can run and edit these examples interactively on Galaxy
Comparative metagenomics
Normalization methods, alpha & beta diversity, and differentially abundant taxa
In this notebook we aim to demonstrate how the MGnifyR tool can be used to fetch data and metadata of a MGnify metagenomic analyisis. Then we show how to generate diversity metrics for comparative metagenomics using taxonomic profiles. Finally, we will use SIAMCAT to detect differentially abundant taxa and to build a ML classification model.
MGnifyR is a library that provides a set of tools for easily accessing and processing MGnify data in R, making queries to MGnify databases through the MGnify API. The benefit of MGnifyR is that data can either be fetched in tsv format or be directly combined in a phyloseq object to run an analysis in a custom workflow.
This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter
MGnifyR is an R package that provides a convenient way for R users to access data from the MGnify API.
Detailed help for each function is available in R using the standard ?function_name command (i.e. typing ?mgnify_query will bring up built-in help for the mgnify_query command).
A vignette is available containing a reasonably verbose overview of the main functionality. This can be read either within R with the vignette("MGnifyR") command, or in the development repository
MGnifyR Command cheat sheet
The following list of key functions should give a starting point for finding relevent documentation.
mgnify_client() : Create the client object required for all other functions.
mgnify_query() : Search the whole MGnify database.
mgnify_analyses_from_xxx() : Convert xxx accessions to analyses accessions. xxx is either samples or studies.
mgnify_get_analyses_metadata() : Retrieve all study, sample and analysis metadata for given analyses.
mgnify_get_analyses_phyloseq() : Convert abundance, taxonomic, and sample metadata into a single phyloseq object.
mgnify_get_analyses_results() : Get functional annotation results for a set of analyses.
mgnify_download() : Download raw results files from MGnify.
mgnify_retrieve_json() : Low level API access helper function.
# Setting tables and figures size to display (these will be reset later):options(repr.matrix.max.cols=150, repr.matrix.max.rows=200)options(repr.plot.width=4, repr.plot.height=4)
Tip: In case you are interested in running the comparative metagenomic analysis using data from different studies in MGnify, you can adapt the following commands:
Check the sample_environment-feature to discover what kind of samples are part of the study and how many of each exists.
Note: For a comparative study, we need at least 5 samples per group
table(v5_metadata$'sample_environment-feature')
deep chlorophyll maximum layer (ENVO:xxxxxxxx)
76
marine epipelagic mixed layer (ENVO:xxxxxxxxx)
9
mesopelagic zone (ENVO:00000213)
43
mesopelagic zone (ENVO:00000213) & marine oxygen minimum zone (ENVO:01000065)
13
surface water (ENVO:00002042) layer
6
surface water layer (ENVO:00002042)
100
water layer with no specific feature
1
Filtering out noisy samples.
We want to create a new dataframe containing the relevant samples. In this step, we will also create a clean label for the environment feature. Additionally, we noticed that some samples have extreme non-sense values in the metadata. We will filter out those samples to reduce the noise on the analysis.
Note: To speed up the following analysis, we are going to keep only 34 samples per group (by randomly subsampling the table)
# Saving the list of Surface water samples in a dataframesub1 = v5_metadata[str_detect(v5_metadata$'sample_environment-feature', "surface"), ]# Reducing the metadata table to keep relevant fields onlyvariables_to_keep =c('sample_temperature','sample_depth','sample_salinity','sample_chlorophyll sensor','sample_nitrate sensor','sample_oxygen sensor')reduced_sub1 = sub1[variables_to_keep]# Removing rows with extreme valuesreduced_sub1[reduced_sub1 =="99999"] <-NAreduced_sub1[reduced_sub1 =="99999.0"] <-NAreduced_sub1[reduced_sub1 =="0"] <-NAclean1=na.omit(reduced_sub1)# Subsampling to 34set.seed(345)random_1 = clean1[sample(nrow(clean1), 34), ]random_1$'env_label'=c(rep('Surface', times=length(rownames(random_1))))
# Saving the list of Mesopelagic zone samples in a dataframesub2 = v5_metadata[str_detect(v5_metadata$'sample_environment-feature', "mesopelagic"), ]# Reducing the metadata table to keep relevant fields onlyreduced_sub2 = sub2[variables_to_keep]# Removing rows with extreme valuesreduced_sub2[reduced_sub2 =="99999"] <-NAreduced_sub2[reduced_sub2 =="99999.0"] <-NAreduced_sub2[reduced_sub2 =="0"] <-NAclean2=na.omit(reduced_sub2)# Subsampling to 34set.seed(345)random_2 = clean2[sample(nrow(clean2), 34), ]random_2$'env_label'=c(rep('Mesopelagic', times=length(rownames(random_2))))
Keep only relevant columns in the phyloseq metadata table and transform numeric variables from characters to numbers. Add the environment label as well.
# Keeping relevant metadata in the phyloseq objectvariables_to_keep =c('sample_temperature','sample_depth','sample_salinity','sample_chlorophyll.sensor','sample_nitrate.sensor','sample_oxygen.sensor')df =data.frame(sample_data(ps))[variables_to_keep]# Transforming character to nummeric variablesdf[] =lapply(df, function(x) as.numeric(as.character(x)))sample_data(ps) = df# Adding the env label sample_data(ps)$'env_label'= clean_metadata$env_label
Part 2. Normalization, alpha diversity indices and taxonomic profiles visualization
We can see that samples with number of reads \(\leq 10 ^ {1.5}\) (i.e. \(\lesssim 32\)) seem to be outliers. Let’s filter out the outliers and plot a new histogram.
Remove singletons. Singletons are OTUs of size one, meaning that only one read was assigned to that OTU. These very low-abundance OTUs could be biologically real (belonging to the rare biosphere (1)), or they could be false positives due to sequencing artefacts. Singletons observed in only one sample are more likely to be artefacts, and it is good practice to remove them from the OTUs counts table to avoid artificially over-estimating the OTUs richness. You can find more discussion about this in Robert Edgar’s blog.
Show some stats on the sequencing depth across samples.
max_difference =max(sample_sums(ps_final))/min(sample_sums(ps_final))sprintf("The max difference in sequencing depth is %s", max_difference)options(repr.plot.width=4, repr.plot.height=5)boxplot(sample_sums(ps_final), main="Sequencing depth across samples", xlab="", ylab="Number of reads", col="#a6093d")text(y=boxplot.stats(sample_sums(ps_final))$stats, labels=boxplot.stats(sample_sums(ps_final))$stats, x=1.25)
'The max difference in sequencing depth is 8.92857142857143'
An approximately 10-fold difference in the library sizes means that we will need to apply a normalization method before continuing with the analysis. The most common normalization methods used in microbiome count data are proportions and rarefaction. However, other methods originally developed to normalize RNA-seq counts have been adapted to differential-abundance analysis in microbiome data. A discussion about how to choose the right normalization method is out of the scope of this material, but the topic has been covered in multiple forums and scientific publications. Depending on the downstream analysis we intend to perform, different methods might be appropriate. For instance, to compare groups of samples at community-level through beta-diversity, “…proportions and rarefying produced more accurate comparisons among communities and are the only methods that fully normalized read depths across samples. Additionally, upper quartile, cumulative sum scaling (CSS), edgeR-TMM, and DESeq-VS often masked differences among communities when common OTUs differed, and they produced false positives when rare OTUs differed” (2). On the other hand, for detection of differentially abundant species, “both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes” (3).
In the following examples we will show three popular ways of normalization: relative abundance, rarefaction and cummulative sum scaling.
2.2. Normalization by total sum scaling (TSS, relative abundance or proportions)
Agglomerate taxonomy at Class rank and keep only the most abundant classes (threshold=1%, i.e. 0.01). In microbial data, we expect to observe abundance distributions with a long ‘tail’ of low-abundance organisms which often comprise the large majority of species. For this reason, once the matrix has been transformed to relative abundance, we will transform the taxonomic profile at a high taxonomic rank (Class), agglomerating the counts first and using an abundance threshold of 1% to avoid displaying too many unreadable categories in the plot.
By naked eye, we can see that the profiles looks quite different. Let’s take a deeper look at some interesting Classes to check how the abundance change depending on the normalization method.
For statistical analysis, adonis and betadisper methods are widely used together in β-diversity analyses.
Adonis tests whether two or more groups have different compositions. Adonis calculates the squared deviations of each site to the centroid and then, performs significance tests using F-tests based on sequential sums of squares from permutations of the raw data. It can be seen as an ANOVA using distance matrices (analogous to MANOVA). A non significant p-value (p>0.05), means that there’s no difference in composition between groups (null hypothesis of no difference in composition between groups).
Betadisper tests if two or more groups are homogeneously dispersed in relation to their species in studied samples. This test can be done to see if one group has more compositional variance than another. The method use the distance matrix to calculate the multivariate dispersions (variances; average distance to centroids). Then we use group dispersions to perform an ANOVA test. ANOVA’s p-value not significant (p>0.05) means that group dispersions are homogenous (null hypothesis of no difference in dispersion between groups).
Compute beta diversity using various methods to calculate distance, and perform principle components analysis ploting the first two axes.
According to Pereira et al., (2018)(7), the best normalization method for metagenomic gene abundance (tested in TARA ocean samples) is CSS for large group sizes. For this reason, we will use this method to show beta-diversity. Using as a guide the steps described in (8), we will create a list of suitable distance methods, iterate through them, and display a combined plot. For a better visualization we are going to show the 95% confidence region with an ellipse.
# Generating the methods list and discarding those that are not included in adonis methods listdist_methods =unlist(distanceMethodList)dist_methods = dist_methods[c(-(1:4),-(20:47))]# Iterating through the list to save the plotplist =vector("list", length(dist_methods))names(plist) = dist_methodsfor( i in dist_methods ){# Calculate distance matrix iDist =distance(ps_CSS, method=i)# Calculate ordination iMDS =ordinate(ps_CSS, "MDS", distance=iDist)## Make plot# Don't carry over previous plot (if error, p will be blank) p =NULL# Create plot, store as temp variable, p p =plot_ordination(ps_CSS, iMDS, color="env_label")# Add title to each plot p = p +ggtitle(paste("MDS using distance method ", i, sep=""))# Save the graphic to file. plist[[i]] = p}# Create a combined plotdf =ldply(plist, function(x) x$data)names(df)[1] ="distance"p =ggplot(df, aes(Axis.1, Axis.2, color=env_label))p = p +geom_point(size=3, alpha=0.5)p = p +facet_wrap(~distance, scales="free")p = p +ggtitle("MDS on various distance metrics for TARA ocean dataset") +stat_ellipse(level=0.95, type="norm", geom="polygon", alpha=0, aes(color=env_label)) +theme_bw() +scale_color_manual(values=c("#0a5032", "#a1be1f")) +labs(color ="Environment") options(repr.plot.width=12, repr.plot.height=10)p
Separation of the groups observed in some of the ordination plots could be due to location effect, dispersion effect, or both. To test the location effect (null hypothesis of no difference in composition between groups) we use a permanova implemented in the vegan library’s adonis function. We will use the distance metric that best segregates the groups and determine whether the two groups of samples have different centroids.
metadata =data.frame(sample_data(ps_CSS))css_beta =distance(ps_CSS, method="mountford")adonis2(css_beta ~ env_label, data = metadata, perm=1e3)
A anova.cca: 3 × 5
Df
SumOfSqs
R2
F
Pr(>F)
<dbl>
<dbl>
<dbl>
<dbl>
<dbl>
env_label
1
0.4299351
0.02835938
1.897162
0.000999001
Residual
65
14.7303053
0.97164062
NA
NA
Total
66
15.1602404
1.00000000
NA
NA
Tip: P-value tells you whether or not this result was likely a result of chance (we use p<0.05 as significant).
Adonis assumes there is homogeneity of dispersion among groups. Let’s test this assumption, to check whether the differences detected by adonis are due to variation in dispersion of the data. The strategy is to run a betadisper (also from the vegan library) to calculate variances and evaluate if there’s a significant variation between groups through an anova. Betadisper is a sister function to adonis to study the differences in dispersion within the same geometric framework.
ANOVA’s p-value not significant (p>0.05) means that group dispersions are homogenous (acept the null hypothesis of no difference in dispersion between groups). The general conclusion of adonis and betadisper is that our groups of samples (Surface and Mesopelagic zone water) are homogeneous on group dispersions (compositions vary similarly within the group) while having significantly different compositions between groups (according to adonis p<0.05).
Tip: Distance methods available in distance phyloseq function are listed in the object distanceMethodList. You can check such distance methods in detailed in the vegdistdocumentation
Creating the SIAMCAT object. We first need to create a label to set which group will be used as control for comparisons. We selected arbitrary Surface water as a case group and Mesopelagic layer as a control.
# Creating a siamcat labelsc_label =create.label(meta=sample_data(psglom), label='env_label', case='Surface')# Creating the siamcat objectsiamcat_obj =siamcat(phyloseq=psglom, label=sc_label)
Label used as case:
Surface
Label used as control:
Mesopelagic
+ finished create.label.from.metadata in 0.007 s
+ starting validate.data
+++ checking overlap between labels and features
+ Keeping labels of 67 sample(s).
+++ checking sample number per class
+++ checking overlap between samples and metadata
+ finished validate.data in 0.03 s
Filtering low-abundant features by abundance (threshold=0.001) and prevalence (threshold=0.05).
Computing differentially abundant taxa on filtered object. The function computes for each species the significance using a non-parametric Wilcoxon test and different effect sizes for the association (e.g. AUC or fold change).
Tip: For significantly associated microbial features, the plot shows:
The abundances of the features across the two different classes (Surface layer vs. Mesopelagic zone water)
The significance of the enrichment calculated by a Wilcoxon test (after multiple hypothesis testing correction)
The generalized fold change of each feature
The prevalence shift between the two classes
Adding “auroc” to the panels list, you can also visualize the Area Under the Receiver Operating Characteristics Curve (AU-ROC) as non-parametric effect size measure.
Printing the taxonomic labels of differentially abudant OTUs.
# The results table of differentially abundant OTUs are stored in associations(siamcat_obj)diff_otus =as.list(rownames(associations(siamcat_obj)[associations(siamcat_obj)$p.adj <0.05, ]))# The taxonomic label per OTU is stored in tax_table(psglom)tax_table(psglom)[rownames(tax_table(psglom)) %in% diff_otus, ]
++ metadata variables:
env_label
++ are nested inside the label and have been removed from this analysis
Finished checking metadata for confounders, results plotted to: conf_check.pdf
Tip: The output file has been created in the current directory. You can open the PDF file in a new tab in this notebook by following this link. You will find the following plots:
First Plot: Conditional entropies for metadata variables. The conditional entropy check primarily serves to remove nonsensical variables from subsequent checks. Conditional entropy quantifies the unique information contained in one variable (row) respective to another (column). Identical variables and derived variables which share the exact same information will have a value of zero.
Second plots: glm regression coefficients + significance + AU-ROCs for the statistical test.
Third plots (Q-Q plot, boxplot and barplots): Original confounder check descriptive stat plots. The function produces plot for each meta-variable. When the distributions looks very similar for both controls and cases, it is unlikely that such variable would confound the analyses.
Fourth plots: Variance explained by label versus variance explained by metadata plots. These plots show the variance explained by the label in comparison with the variance explained by the metadata variable for each individual feature. Variables with many features in the upper left corner might be confounding the label associations.
4.3. Generation of a predictive model: Machine learning workflow
Prepare cross-validation. Cross validation is a technique to assess how well a ML model would generalize to external data by partionining the dataset into training and test sets. SIAMCAT greatly simplifies the set-up of cross-validation schemes, including stratification of samples or keeping samples inseperable based on metadata. Here, we split the dataset into 10 parts and then train a model on 9 of these parts and use the left-out part to test the model. The whole process is repeated 20 times.
Features splitted for cross-validation successfully.
Model training. The actual model training is performed using the function train.model. Again, multiple options for customization are available, ranging from the machine learning method to the measure for model selection or customizable parameter set for hyperparameter tuning. Here, we train a Lasso model (10). This step takes a couple of minutes to complete – set verbose = T if you want to see the progress.
Make predictions. This function will automatically apply the models trained in cross validation to their respective test sets and aggregate the predictions across the whole data set.
Model evaluation and plot. In the final part, we want to find out how well the model performed and which microbial markers had been selected in the model. In order to do so, we first calculate how well the predictions fit the real data using the function evaluate.predictions. The function produces two plots for model evaluation. The first plot shows the Receiver Operating Characteristic (ROC)-curves, the other the Precision-recall (PR)-curves for the different cross-validation repetitions.
Interpretation plot. After statistical models have been trained to distinguish Surface layer from Mesopelagic zone water, we will plot characteristics of the models (i.e. model coefficients or feature importance) alongside the input data aiding in understanding how/why the model works (or not).
Successfully plotted model interpretation plot to: interpretation_plot.pdf
Tip: The output file has been created in the current directory, you can explore the content of the PDF file following this link. The plots shows:
The median relative feature weight for selected features (barplot on the left)
The robustness of features (i.e. in how many of the models the specific feature has been selected)
The distribution of selected features across samples (central heatmap)
Which proportion of the weight of all different models are shown in the plot (boxplot on the right)
Distribution of metadata across samples (heatmap below)
### References:
1. Lynch, M., Neufeld, J. Ecology and exploration of the rare biosphere. Nat Rev Microbiol 13:217–229 (2015) DOI:10.1038/nrmicro3400
2. McKnight, DT., Huerlimann, R., Bower, DS. et al. Methods for Normalizing Microbiome Data: An Ecological Perspective. Methods in Ecology and Evolution / British Ecological Society 10(3):389–400 (2019) DOI:10.1111/2041-210X.13115
3. McMurdie, PJ., Holmes, S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput Biol 10(4):e1003531 (2014) DOI:10.1371/journal.pcbi.1003531
4. Willis, A. Rarefaction, Alpha Diversity, and Statistics. Front. Microbiol. Sec. Terrestrial Microbiology 10:2407 (2019) DOI:10.3389/fmicb.2019.02407
5. Kim, BR., Shin, J., Guevarra, R. et al. Deciphering Diversity Indices for a Better Understanding of Microbial Communities. J Microbiol Biotechnol 28;27(12):2089-2093 (2017) DOI:10.4014/jmb.1709.09027
6. Wagner, BD., Grunwald, GK., Zerbe, GO. et al. On the Use of Diversity Measures in Longitudinal Sequencing Studies of Microbial Communities. Front. Microbiol. 9:1037 (2018) DOI:10.3389/fmicb.2018.01037
7. Pereira, M., Wallroth, M., Jonsson, V. et al. Comparison of normalization methods for the analysis of metagenomic gene abundance data. BMC Genomics 19,274 (2018) DOI:10.1186/s12864-018-4637-6
8. McMurdie, P., Holmes, S. Tutorial of 2018: The distance function in phyloseq. Available at website
9. Wirbel, J., Zych, K., Essex, M. et al. Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox. Genome Biol 22:93 (2021) DOI:10.1186/s13059-021-02306-1
10. Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1):267-288 (1996) DOI:10.1111/j.2517-6161.1996.tb02080.x
Documentation and more MGnifyR code and exercises available on GitHub and on rdrr site