Comparative Metagenomics


Alejandra Escobar

MGnify team at EMBL-EBI

This is a static preview

You can run and edit these examples interactively on Galaxy

Comparative metagenomics

Normalization methods, alpha & beta diversity, and differentially abundant taxa

In this notebook we aim to demonstrate how the MGnifyR tool can be used to fetch data and metadata of a MGnify metagenomic analyisis. Then we show how to generate diversity metrics for comparative metagenomics using taxonomic profiles. Finally, we will use SIAMCAT to detect differentially abundant taxa and to build a ML classification model.

MGnifyR is a library that provides a set of tools for easily accessing and processing MGnify data in R, making queries to MGnify databases through the MGnify API. The benefit of MGnifyR is that data can either be fetched in tsv format or be directly combined in a phyloseq object to run an analysis in a custom workflow.

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter


# Loading libraries:
display_markdown(file = '../_resources/')
# Setting tables and figures size to display (these will be reset later):
options(repr.matrix.max.cols=150, repr.matrix.max.rows=200)
options(repr.plot.width=4, repr.plot.height=4)

Part 1. Fetching data and preprocessing

In this example we are going to fetch MGnify analysis results and metadata for TARA ocean metagenomic study corresponding to size fractions for prokaryotes (MGYS00002008). Find more information about the TARA Ocean Project.

1.1. Fetch the MGnify Analyses accession

  1. The first step is to retrieve the analysis accession list.
# Create your session mgnify_client object
mg = mgnify_client(usecache = T, cache_dir = '/home/jovyan/.mgnify_cache')
tara_all = mgnify_analyses_from_studies(mg, 'MGYS00002008')
  1. Use the list of accessions to fetch the metadata for all of the analyses from the MGnify API.
metadata = mgnify_get_analyses_metadata(mg, tara_all)

Tip: In case you are interested in running the comparative metagenomic analysis using data from different studies in MGnify, you can adapt the following commands:

analyses_accessions = mgnify_analyses_from_studies(mg, c("MGYS1","MGYS2"))

metadata = mgnify_get_analyses_metadata(mg, analyses_accessions)

1.2. Explore and filter samples by metadata

We want to keep only metagenomic samples (not amplicon) of ‘surface’ and ‘mesopelagic zone’ to compare. We also want to filter out results generated with old pipeline versions (<v5.0). In the following steps we will filter out other samples before exporting to the phyloseq object. Let’s first explore the metadata we fetched:

  1. Check the number of analyses in the study.
  1. Check the analysis_experiment-type to determine whether a filtering is necesary to discard amplicon samples.
  1. Keep results generated only with the most updated pipeline (v5.0).
v5_metadata = metadata[which(metadata$'analysis_pipeline-version'=='5.0'), ]
  1. Check the sample_environment-feature to discover what kind of samples are part of the study and how many of each exists.

Note: For a comparative study, we need at least 5 samples per group

  1. Filtering out noisy samples.

We want to create a new dataframe containing the relevant samples. In this step, we will also create a clean label for the environment feature. Additionally, we noticed that some samples have extreme non-sense values in the metadata. We will filter out those samples to reduce the noise on the analysis.

Note: To speed up the following analysis, we are going to keep only 34 samples per group (by randomly subsampling the table)

# Saving the list of Surface water samples in a dataframe
sub1 = v5_metadata[str_detect(v5_metadata$'sample_environment-feature', "surface"), ]

# Reducing the metadata table to keep relevant fields only
variables_to_keep = c('sample_temperature','sample_depth','sample_salinity','sample_chlorophyll sensor','sample_nitrate sensor','sample_oxygen sensor')
reduced_sub1 = sub1[variables_to_keep]

# Removing rows with extreme values
reduced_sub1[reduced_sub1 == "99999"] <- NA
reduced_sub1[reduced_sub1 == "99999.0"] <- NA
reduced_sub1[reduced_sub1 == "0"] <- NA


# Subsampling to 34
random_1 = clean1[sample(nrow(clean1), 34), ]
random_1$'env_label'=c(rep('Surface', times=length(rownames(random_1))))
# Saving the list of Mesopelagic zone samples in a dataframe
sub2 = v5_metadata[str_detect(v5_metadata$'sample_environment-feature', "mesopelagic"), ]

# Reducing the metadata table to keep relevant fields only
reduced_sub2 = sub2[variables_to_keep]

# Removing rows with extreme values
reduced_sub2[reduced_sub2 == "99999"] <- NA
reduced_sub2[reduced_sub2 == "99999.0"] <- NA
reduced_sub2[reduced_sub2 == "0"] <- NA


# Subsampling to 34
random_2 = clean2[sample(nrow(clean2), 34), ]
random_2$'env_label'=c(rep('Mesopelagic', times=length(rownames(random_2))))

1.3. Converting into phyloseq object

  1. Now that we have a new dataframe with samples from either surface or mesopelagic zone water, we are going to create the phyloseq object.
ps = mgnify_get_analyses_phyloseq(mg, clean_acc)
  1. Keep only relevant columns in the phyloseq metadata table and transform numeric variables from characters to numbers. Add the environment label as well.
# Keeping relevant metadata in the phyloseq object
variables_to_keep = c('sample_temperature','sample_depth','sample_salinity','sample_chlorophyll.sensor','sample_nitrate.sensor','sample_oxygen.sensor')
df = data.frame(sample_data(ps))[variables_to_keep]

# Transforming character to nummeric variables
df[] = lapply(df, function(x) as.numeric(as.character(x)))
sample_data(ps) = df

# Adding the env label                              
sample_data(ps)$'env_label' = clean_metadata$env_label

Part 2. Normalization, alpha diversity indices and taxonomic profiles visualization

2.1. Cleaning the OTUs matrix

  1. Remove samples with extremely low coverage – they aren’t informative and interfere with the normalization process. The first step is to detect outliers by plotting some histograms.
options(repr.plot.width=4, repr.plot.height=4)
hist(log10(sample_sums(ps)), breaks=50, main="Sample size distribution", xlab="Sample size (log10)", ylab="Frequency", col="#007c80")

We can see that samples with number of reads \(\leq 10 ^ {1.5}\) (i.e. \(\lesssim 32\)) seem to be outliers. Let’s filter out the outliers and plot a new histogram.

ps_good = subset_samples(ps, sample_sums(ps) > 32)
hist(log10(sample_sums(ps_good)), breaks=50, main="Sample size distribution", xlab="Sample size (log10)", ylab="Frequency", col="#007c80")
  1. Remove singletons. Singletons are OTUs of size one, meaning that only one read was assigned to that OTU. These very low-abundance OTUs could be biologically real (belonging to the rare biosphere (1)), or they could be false positives due to sequencing artefacts. Singletons observed in only one sample are more likely to be artefacts, and it is good practice to remove them from the OTUs counts table to avoid artificially over-estimating the OTUs richness. You can find more discussion about this in Robert Edgar’s blog.
ps_final = filter_taxa(ps_good, function(x) sum(x) > 1, prune=TRUE)
  1. Show some stats on the sequencing depth across samples.
max_difference = max(sample_sums(ps_final))/min(sample_sums(ps_final))

sprintf("The max difference in sequencing depth is %s", max_difference)

options(repr.plot.width=4, repr.plot.height=5)

boxplot(sample_sums(ps_final), main="Sequencing depth across samples", xlab="", ylab="Number of reads", col="#a6093d")
text(y=boxplot.stats(sample_sums(ps_final))$stats, labels=boxplot.stats(sample_sums(ps_final))$stats, x=1.25)

An approximately 10-fold difference in the library sizes means that we will need to apply a normalization method before continuing with the analysis. The most common normalization methods used in microbiome count data are proportions and rarefaction. However, other methods originally developed to normalize RNA-seq counts have been adapted to differential-abundance analysis in microbiome data. A discussion about how to choose the right normalization method is out of the scope of this material, but the topic has been covered in multiple forums and scientific publications. Depending on the downstream analysis we intend to perform, different methods might be appropriate. For instance, to compare groups of samples at community-level through beta-diversity, “…proportions and rarefying produced more accurate comparisons among communities and are the only methods that fully normalized read depths across samples. Additionally, upper quartile, cumulative sum scaling (CSS), edgeR-TMM, and DESeq-VS often masked differences among communities when common OTUs differed, and they produced false positives when rare OTUs differed” (2). On the other hand, for detection of differentially abundant species, “both proportions and rarefied counts result in a high rate of false positives in tests for species that are differentially abundant across sample classes” (3).

In the following examples we will show three popular ways of normalization: relative abundance, rarefaction and cummulative sum scaling.

2.2. Normalization by total sum scaling (TSS, relative abundance or proportions)

The simplest way to normalize the differences in sample size is to transform the OTU counts table into relative abundance by dividing by the number of total reads of each sample. This type of normalization is also referred to as relative abundance or proportions. We use this normalization to compare taxonomic profiles, while alpha diversity indices are computed on the non-normalized matrix. The reason to do so is that we need a matrix of integer numbers as input.

  1. Transform taxonomy raw-counts matrix into relative abundance.
relab_ps = transform_sample_counts(ps_final, function(x) x/sum(x))
  1. Agglomerate taxonomy at Class rank and keep only the most abundant classes (threshold=1%, i.e. 0.01). In microbial data, we expect to observe abundance distributions with a long ‘tail’ of low-abundance organisms which often comprise the large majority of species. For this reason, once the matrix has been transformed to relative abundance, we will transform the taxonomic profile at a high taxonomic rank (Class), agglomerating the counts first and using an abundance threshold of 1% to avoid displaying too many unreadable categories in the plot.
psglom = tax_glom(relab_ps, "Class")
top_tss_ps = filter_taxa(psglom, function(x) mean(x) > 0.01, TRUE)

2.3. Normalization by subsampling (rarefaction)

Rarefaction is an alternative to relative abundance normalization to obtain an adjusted OTUs count matrix. The method is based on a process of subsampling to the smallest library size in the data set. The algorithm randomly removes reads until the samples reach the same library size. Despite the apparent disadvantage of discarding information from the larger samples, rarefaction is quite popular in microbial ecology. The first step is to find the smallest sample size. We can use the number of observed OTUs in the original matrix to do so.

  1. Find the smallest sample size.
head(df[order( df[,1] ),],1)
  1. Rarefying to the smallest sample.
ps_rare = rarefy_even_depth(ps_final, sample.size=56, replace=FALSE, rngseed=123, verbose=FALSE)
  1. Aglomerate taxonomy at Class rank keeping the top 15 classes only.
psglom = tax_glom(ps_rare, "Class")
top15 = names(sort(taxa_sums(psglom), decreasing=TRUE)[1:15])
top15_rare_ps = prune_taxa(top15, psglom)

2.4. Normalization by cumulative sum scaling (CSS)

The third normalization method we are going to apply is CSS. To do so, we will use the implementation on the microbiomeMarker library. Cumulative sum scaling normalization calculates scaling factors as the cumulative sum of gene (or taxa) abundances up to a data-derived threshold. This method is based on the assumption that the count distributions in each sample are equivalent for low abundance genes up to a certain threshold. Only the segment of each sample’s count distribution that is relatively invariant across samples is scaled by CSS.

  1. Normalizing the OTU counts in the ps_final object.
ps_CSS = normalize(ps_final, method="CSS")
  1. Aglomerate taxonomy at Class rank and keep the top 15 classes only.
psglom = tax_glom(ps_CSS, "Class")
top15 = names(sort(taxa_sums(psglom), decreasing=TRUE)[1:15])
top15_css_ps = prune_taxa(top15, psglom)

2.5. Computing alpha diversity indices

Alpha diversity is a measure of species diversity in a particular area or an ecosystem. In terms of microbial ecology, analyzing the sample diversity of sequencing data is commonly the first step in assessing differences between microbial environments. Some of the most popular alpha diversity indices reported in microbial community analyses are Chao, abundance-based coverage estimators (ACE), Simpson, and Shannon-Weaver.

Chao1 and ACE are nonparametric estimators that describe diversity as the max number of expected species in the sample (species richness). They consider the proportion of species that have been observed before (“recaptured”) to those that are observed only once. Both Chao1 and ACE underestimate true richness at low sample sizes.

On the other hand, Simpson and Shannon-Weaver, consider relative abundances and depends on both, species richness and the evenness (or equitableness), with which individuals are distributed among the different species. However, both metrics have specific biases. The Shannon-Weaver index places a greater weight on species richness, whereas the Simpson index considers evenness more than richness in its measurement.

More discussion and examples illustrating how sample size and normalization methods can affect alpha diversity metrics in references 4 and 5.

options(repr.plot.width=12, repr.plot.height=3)

### No normalization
plot_richness(ps_final, x="env_label", color="env_label", title="No normalization", measures=c("Observed", "Chao1", "ACE", "Shannon", "Simpson")) + 
    geom_boxplot() + 
    theme_bw() + 
    scale_color_manual(values=c("#0a5032", "#a1be1f")) + 
    labs(x='', color = "Environment")

### Rarefaction
plot_richness(ps_rare, x="env_label", color="env_label", title="Rarefaction", measures=c("Observed", "Chao1", "ACE", "Shannon", "Simpson")) + 
    geom_boxplot() + 
    theme_bw() + 
    scale_color_manual(values=c("#0a5032", "#a1be1f")) + 
    labs(x='', color = "Environment")

### Cumulative sum scaling
plot_richness(ps_CSS, x="env_label", color="env_label", title="Cumulative sum scaling", measures=c("Observed", "Chao1", "ACE", "Shannon", "Simpson")) + 
    geom_boxplot() + 
    theme_bw() + 
    scale_color_manual(values=c("#0a5032", "#a1be1f")) + 
    labs(x='', color="Environment")

2.6. Community profile visualization

Visualise the taxonomic profile in barplots at Class rank in two modes.

options(repr.plot.width=10, repr.plot.height=4)

### Proportions
plot_bar(top_tss_ps, fill = "Class", title="Proportions") + 
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 

### Rarefaction
plot_bar(top15_rare_ps, fill="Class", title="Rarefaction") + 
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 

### Cumulative sum scaling
plot_bar(top15_css_ps, fill="Class", title="Cumulative sum scaling") + 
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 

By naked eye, we can see that the profiles looks quite different. Let’s take a deeper look at some interesting Classes to check how the abundance change depending on the normalization method.

options(repr.plot.width=12, repr.plot.height=2)

### Proportions
plot_bar(subset_taxa(top_tss_ps, Class=='Alphaproteobacteria' | Class=='Dinophyceae' | Class=='Thermoplasmata' ), fill = "Class", title="Proportions") + 
    facet_wrap(~Class) + 
    scale_fill_manual(values=c("#18974c","#d41645","#f49e17")) +
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 

### Rarefaction
plot_bar(subset_taxa(top15_rare_ps, Class=='Alphaproteobacteria' | Class=='Dinophyceae' | Class=='Thermoplasmata' ), fill="Class", title="Rarefaction") + 
    facet_wrap(~Class) + 
    scale_fill_manual(values=c("#18974c","#d41645","#f49e17")) +
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 

### Cumulative sum scaling
plot_bar(subset_taxa(top15_css_ps, Class=='Alphaproteobacteria' | Class=='Dinophyceae' | Class=='Thermoplasmata' ), fill="Class", title="Cumulative sum scaling") + 
    facet_wrap(~Class) + 
    scale_fill_manual(values=c("#18974c","#d41645","#f49e17")) +
    theme(axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.background=element_rect(fill=NA), panel.grid.major=element_line(colour="#ebebeb")) + 

Part 3. Comparative metagenomics at community-level: Beta diversity

Beta (β)-diversity is used in ecology to assess the heterogeneity of local communities. It is defined as the ratio between regional and local species diversity. In simpler terms, it calculates the number of species that are not the same in two different environments.

Some of the most popular beta diversity measures in microbiome research include Bray-Curtis index (for compositional data), Jaccard index (for presence/absence data, ignoring abundance information), Aitchison distance (Euclidean distance for transformed abundances, aiming to avoid the compositionality bias), and the Unifrac distances (that use the phylogenetic tree information). Only some of the commonly used beta diversity measures are actual distances, therefore, the term dissimilarity or beta-diversity is commonly used. The workflow to assess β-diversity between groups of samples starts with calculating the diversity indices for each pair of samples. This results in a distance (or dissimilarity) matrix that is often used for ordination (e.g., principal coordinates analysis) and data exploration in microbiota data analysis (6).

For statistical analysis, adonis and betadisper methods are widely used together in β-diversity analyses.

Adonis tests whether two or more groups have different compositions. Adonis calculates the squared deviations of each site to the centroid and then, performs significance tests using F-tests based on sequential sums of squares from permutations of the raw data. It can be seen as an ANOVA using distance matrices (analogous to MANOVA). A non significant p-value (p>0.05), means that there’s no difference in composition between groups (null hypothesis of no difference in composition between groups).

Betadisper tests if two or more groups are homogeneously dispersed in relation to their species in studied samples. This test can be done to see if one group has more compositional variance than another. The method use the distance matrix to calculate the multivariate dispersions (variances; average distance to centroids). Then we use group dispersions to perform an ANOVA test. ANOVA’s p-value not significant (p>0.05) means that group dispersions are homogenous (null hypothesis of no difference in dispersion between groups).

  1. Compute beta diversity using various methods to calculate distance, and perform principle components analysis ploting the first two axes.

According to Pereira et al., (2018)(7), the best normalization method for metagenomic gene abundance (tested in TARA ocean samples) is CSS for large group sizes. For this reason, we will use this method to show beta-diversity. Using as a guide the steps described in (8), we will create a list of suitable distance methods, iterate through them, and display a combined plot. For a better visualization we are going to show the 95% confidence region with an ellipse.

# Generating the methods list and discarding those that are not included in adonis methods list
dist_methods = unlist(distanceMethodList)
dist_methods = dist_methods[c(-(1:4),-(20:47))]

# Iterating through the list to save the plot
plist = vector("list", length(dist_methods))
names(plist) = dist_methods
for( i in dist_methods ){
    # Calculate distance matrix
    iDist = distance(ps_CSS, method=i)
    # Calculate ordination
    iMDS  = ordinate(ps_CSS, "MDS", distance=iDist)
    ## Make plot
    # Don't carry over previous plot (if error, p will be blank)
    p = NULL
    # Create plot, store as temp variable, p
    p = plot_ordination(ps_CSS, iMDS, color="env_label")
    # Add title to each plot
    p = p + ggtitle(paste("MDS using distance method ", i, sep=""))
    # Save the graphic to file.
    plist[[i]] = p

# Create a combined plot
df = ldply(plist, function(x) x$data)
names(df)[1] = "distance"
p = ggplot(df, aes(Axis.1, Axis.2, color=env_label))
p = p + geom_point(size=3, alpha=0.5)
p = p + facet_wrap(~distance, scales="free")
p = p + ggtitle("MDS on various distance metrics for TARA ocean dataset") + 
    stat_ellipse(level=0.95, type="norm", geom="polygon", alpha=0, aes(color=env_label)) + 
    theme_bw() + 
    scale_color_manual(values=c("#0a5032", "#a1be1f")) + 
    labs(color = "Environment") 

options(repr.plot.width=12, repr.plot.height=10)
  1. Separation of the groups observed in some of the ordination plots could be due to location effect, dispersion effect, or both. To test the location effect (null hypothesis of no difference in composition between groups) we use a permanova implemented in the vegan library’s adonis function. We will use the distance metric that best segregates the groups and determine whether the two groups of samples have different centroids.
metadata = data.frame(sample_data(ps_CSS))
css_beta = distance(ps_CSS, method="mountford")
adonis2(css_beta ~ env_label, data = metadata, perm=1e3)

Tip: P-value tells you whether or not this result was likely a result of chance (we use p<0.05 as significant).

  1. Adonis assumes there is homogeneity of dispersion among groups. Let’s test this assumption, to check whether the differences detected by adonis are due to variation in dispersion of the data. The strategy is to run a betadisper (also from the vegan library) to calculate variances and evaluate if there’s a significant variation between groups through an anova. Betadisper is a sister function to adonis to study the differences in dispersion within the same geometric framework.
bd = betadisper(css_beta, metadata$'env_label')

ANOVA’s p-value not significant (p>0.05) means that group dispersions are homogenous (acept the null hypothesis of no difference in dispersion between groups). The general conclusion of adonis and betadisper is that our groups of samples (Surface and Mesopelagic zone water) are homogeneous on group dispersions (compositions vary similarly within the group) while having significantly different compositions between groups (according to adonis p<0.05).

Tip: Distance methods available in distance phyloseq function are listed in the object distanceMethodList. You can check such distance methods in detailed in the vegdist documentation

Part 4. Detection of differentially abundant taxa

There are many approaches out there to detect differentially abundant taxa (or genes). Here we are using the method implemented in SIAMCAT library. SIAMCAT is an R package that puts together statistical functions to compare large-scale studies in order to detect microbial community composition changes due to environmental factors. Describing such associations in quantitative models has multiple applications, such as in predicting disease status based on microbiome data. In addition, the pipeline includes the generation of high-quality plots for visual inspection of results (9). There are multiple parameters that can be adjusted for better performance in each step of the analysis. However, in this exercise we are using default arguments with minimal tuning.

4.1. Association testing

  1. Preparing the data. The input for SIAMCAT is a matrix with normalized data as relative abundances. We will use the phyloseq object generated in step 2.2.2 of this notebook and aggregate the data at genus level.
psglom = tax_glom(relab_ps, "Genus")
  1. Creating the SIAMCAT object. We first need to create a label to set which group will be used as control for comparisons. We selected arbitrary Surface water as a case group and Mesopelagic layer as a control.
# Creating a siamcat label
sc_label = create.label(meta=sample_data(psglom), label='env_label', case='Surface')

# Creating the siamcat object
siamcat_obj = siamcat(phyloseq=psglom, label=sc_label)
  1. Filtering low-abundant features by abundance (threshold=0.001) and prevalence (threshold=0.05).
siamcat_obj = filter.features(siamcat_obj, filter.method='abundance', cutoff=1e-03)
siamcat_obj = filter.features(siamcat_obj, filter.method='prevalence', cutoff=0.05, feature.type='filtered')
  1. Computing differentially abundant taxa on filtered object. The function computes for each species the significance using a non-parametric Wilcoxon test and different effect sizes for the association (e.g. AUC or fold change).
options(repr.plot.width=12, repr.plot.height=7)
siamcat_obj = check.associations(siamcat_obj)
association.plot(siamcat_obj, panels=c("fc", "prevalence"), prompt = FALSE, verbose = 0)
Tip: For significantly associated microbial features, the plot shows:
  • The abundances of the features across the two different classes (Surface layer vs. Mesopelagic zone water)
  • The significance of the enrichment calculated by a Wilcoxon test (after multiple hypothesis testing correction)
  • The generalized fold change of each feature
  • The prevalence shift between the two classes

Adding “auroc” to the panels list, you can also visualize the Area Under the Receiver Operating Characteristics Curve (AU-ROC) as non-parametric effect size measure.

  1. Printing the taxonomic labels of differentially abudant OTUs.
# The results table of differentially abundant OTUs are stored in associations(siamcat_obj)
diff_otus = as.list(rownames(associations(siamcat_obj)[associations(siamcat_obj)$p.adj < 0.05, ]))

# The taxonomic label per OTU is stored in tax_table(psglom)
tax_table(psglom)[rownames(tax_table(psglom)) %in% diff_otus, ]

4.2. Cofounder testing

  1. Test the associated metadata variables for potential confounding influence. As many biological and technical factors beyond the primary phenotype of interest can influence microbiome composition, simple association studies may suffer confounding by other variables, which can lead to spurious results. Associations are visualized either as barplot or Q-Q plot, depending on the type of metadata. Additionally, it evaluates associations among metadata variables using conditional entropy and associations with the label using generalized linear models, producing a correlation heatmap and appropriate quantitative barplots, respectively.
suppressWarnings(check.confounders(siamcat_obj, fn.plot='conf_check.pdf'))
Tip: The output file has been created in the current directory. You can open the PDF file in a new tab in this notebook by following this link. You will find the following plots:
  • First Plot: Conditional entropies for metadata variables. The conditional entropy check primarily serves to remove nonsensical variables from subsequent checks. Conditional entropy quantifies the unique information contained in one variable (row) respective to another (column). Identical variables and derived variables which share the exact same information will have a value of zero.
  • Second plots: glm regression coefficients + significance + AU-ROCs for the statistical test.
  • Third plots (Q-Q plot, boxplot and barplots): Original confounder check descriptive stat plots. The function produces plot for each meta-variable. When the distributions looks very similar for both controls and cases, it is unlikely that such variable would confound the analyses.
  • Fourth plots: Variance explained by label versus variance explained by metadata plots. These plots show the variance explained by the label in comparison with the variance explained by the metadata variable for each individual feature. Variables with many features in the upper left corner might be confounding the label associations.

4.3. Generation of a predictive model: Machine learning workflow

Another feature of SIAMCAT is a versatile but easy-to-use interface for the construction of machine learning meta-analysis models on the basis of microbial markers. The steps to follow include functions for data normalization, splitting the data into cross-validation folds, training the model, and making predictions based on cross-validation instances and the trained models. Such a model can be built on one dataset and then be applied on another, similarly processed holdout dataset. This might be of interest when comparing data from two different studies testing the same condition.

  1. Data normalisation. SIAMCAT offers a few normalization approaches that can be useful for subsequent statistical modeling in the sense that they transform features in a way that can increase the accuracy of the resulting models. Importantly, these normalization techniques do not make use of any label information, and can thus be applied up front to the whole data set and outside of the following cross validation.
siamcat_obj = normalize.features(
    norm.param=list(log.n0=1e-06, sd.min.q=0)
  1. Prepare cross-validation. Cross validation is a technique to assess how well a ML model would generalize to external data by partionining the dataset into training and test sets. SIAMCAT greatly simplifies the set-up of cross-validation schemes, including stratification of samples or keeping samples inseperable based on metadata. Here, we split the dataset into 10 parts and then train a model on 9 of these parts and use the left-out part to test the model. The whole process is repeated 20 times.
siamcat_obj =, num.folds=10, num.resample=10)
  1. Model training. The actual model training is performed using the function train.model. Again, multiple options for customization are available, ranging from the machine learning method to the measure for model selection or customizable parameter set for hyperparameter tuning. Here, we train a Lasso model (10). This step takes a couple of minutes to complete – set verbose = T if you want to see the progress.
siamcat_obj = train.model(siamcat_obj, method='lasso', verbose = F)
  1. Make predictions. This function will automatically apply the models trained in cross validation to their respective test sets and aggregate the predictions across the whole data set.
siamcat_obj = make.predictions(siamcat_obj, verbose = F)
  1. Model evaluation and plot. In the final part, we want to find out how well the model performed and which microbial markers had been selected in the model. In order to do so, we first calculate how well the predictions fit the real data using the function evaluate.predictions. The function produces two plots for model evaluation. The first plot shows the Receiver Operating Characteristic (ROC)-curves, the other the Precision-recall (PR)-curves for the different cross-validation repetitions.
siamcat_obj = evaluate.predictions(siamcat_obj)

options(repr.plot.width=5, repr.plot.height=5)
model.evaluation.plot(siamcat_obj, colours="#007c80")
  1. Interpretation plot. After statistical models have been trained to distinguish Surface layer from Mesopelagic zone water, we will plot characteristics of the models (i.e. model coefficients or feature importance) alongside the input data aiding in understanding how/why the model works (or not).
model.interpretation.plot(siamcat_obj, consens.thres = 0.5, fn.plot = 'interpretation_plot.pdf')
Tip: The output file has been created in the current directory, you can explore the content of the PDF file following this link. The plots shows:
  • The median relative feature weight for selected features (barplot on the left)
  • The robustness of features (i.e. in how many of the models the specific feature has been selected)
  • The distribution of selected features across samples (central heatmap)
  • Which proportion of the weight of all different models are shown in the plot (boxplot on the right)
  • Distribution of metadata across samples (heatmap below)

### References:

1. Lynch, M., Neufeld, J. Ecology and exploration of the rare biosphere. Nat Rev Microbiol 13:217–229 (2015) DOI:10.1038/nrmicro3400

2. McKnight, DT., Huerlimann, R., Bower, DS. et al. Methods for Normalizing Microbiome Data: An Ecological Perspective. Methods in Ecology and Evolution / British Ecological Society 10(3):389–400 (2019) DOI:10.1111/2041-210X.13115

3. McMurdie, PJ., Holmes, S. Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible. PLoS Comput Biol 10(4):e1003531 (2014) DOI:10.1371/journal.pcbi.1003531

4. Willis, A. Rarefaction, Alpha Diversity, and Statistics. Front. Microbiol. Sec. Terrestrial Microbiology 10:2407 (2019) DOI:10.3389/fmicb.2019.02407

5. Kim, BR., Shin, J., Guevarra, R. et al. Deciphering Diversity Indices for a Better Understanding of Microbial Communities. J Microbiol Biotechnol 28;27(12):2089-2093 (2017) DOI:10.4014/jmb.1709.09027

6. Wagner, BD., Grunwald, GK., Zerbe, GO. et al. On the Use of Diversity Measures in Longitudinal Sequencing Studies of Microbial Communities. Front. Microbiol. 9:1037 (2018) DOI:10.3389/fmicb.2018.01037

7. Pereira, M., Wallroth, M., Jonsson, V. et al. Comparison of normalization methods for the analysis of metagenomic gene abundance data. BMC Genomics 19,274 (2018) DOI:10.1186/s12864-018-4637-6

8. McMurdie, P., Holmes, S. Tutorial of 2018: The distance function in phyloseq. Available at website

9. Wirbel, J., Zych, K., Essex, M. et al. Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox. Genome Biol 22:93 (2021) DOI:10.1186/s13059-021-02306-1

10. Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1):267-288 (1996) DOI:10.1111/j.2517-6161.1996.tb02080.x

Documentation and more MGnifyR code and exercises available on GitHub and on rdrr site

Phyloseq tutorials available on GitHub

Visit the SIAMCAT documentation