Analysis pipeline v5.0

Description of the latest MGnify analysis pipeline and the tools it uses.
Author
Affiliation
Published

October 3, 2024

Overview

The latest MGnify analysis service (version 5.0) offers specialised workflows for three different data types: amplicon, raw metagenomic/metatranscriptomic reads, and assembly. Each workflow is defined in common workflow language (CWL). (MGnify v5.0 CWL repository) All databases are available from an FTP link

The software and databases used for the various processing steps and analyses are listed in the following table.

Software, Databases and Versions used by MGnify:

Tool/Database Version Purpose Amplicon Raw reads Assemblies
SeqPrep 1.2 Merging paired end reads Yes Yes
Trimmomatic 0.36 Quality controlled Yes Yes
Biopython 1.74 Quality controlled Yes Yes Yes
bedtools 2.28.0 SSU/LSU rRNA masking for ITS Yes
pre-assembled 0.45h Sequence extraction Yes Yes Yes
Infernal 1.1.2 RNA predictions Yes Yes Yes
Rfam 13.0 Identification of SSU/LSU rRNA and other ncRNAs Yes Yes Yes
MAPseq 1.2.3 Taxonomic assignment of SSU/LSU rRNA and ITS Yes Yes Yes
Kronatools 2.7.1 Visualisation of taxonomic analyses Yes Yes Yes
biom-format 2.1.6 Formatting of taxonomic analyses Yes Yes Yes
mOTUs2 2.5.1 Phylogenetic marker gene based taxonomic profiling Yes
FragGeneScan 1.20 Protein coding sequence prediction Yes Yes
Prodigal 2.6.3 Protein coding sequence prediction Yes
InterProScan 75.0 Protein function annotation with separate Pfam results Yes Yes
GO terms in-house scripts N/A Assign gene ontology terms Yes Yes
eggNOG-mapper 4.5.1 Protein function annotation Yes
eggNOG-mapper 1.0.3 Protein function annotation Yes
HMMER 3.2.1 KEGG Ortholog prediction Yes Yes
KOfam - a modified version based on KEGG 90.0 2019-04-06 KEGG Ortholog prediction Yes Yes
KEGG and in-house descriptions 90.0 KEGG pathway predictions Yes
Genome Properties 2.0.1 Systems and pathways annotation Yes
antiSMASH 4.2.0 Secondary metabolite biosynthetic gene cluster annotation Yes
DIAMOND 0.9.25.126 Protein sequence-based taxonomic analysis Yes
SILVA release 132 SSU/LSU rRNA taxonomic database Yes Yes Yes
ITSoneDB 1.138 ITS1 taxonomic database Yes
UNITE 8.0 ITS taxonomic database Yes
UniRef90 2019_11 Protein sequence-based taxonomic analysis Yes
metaSPAdes 3.13 Assembly of raw reads (available on request) N/A N/A N/A

Amplicon analysis pipeline

Amplicon reads are merged with SeqPrep (where appropriate) and filtered with Trimmomatic to trim sequence regions with an average Phred 33 quality score of less than 15 in a sliding window of 4 base pairs. This is followed by removal of reads less than 100bp in length. An additional Biopython filtering step removes reads with more than 10% ambiguous bases. Infernal (running in hmm-only mode) using a library of ribosomal RNA hidden Markov models from Rfam is run to identify large and small subunit ribosomal ribonucleic acid (LSU and SSU rRNA) genes, using families found in the following clans: CL00111 (SSU) and CL00112 (LSU). Theses undergo taxonomic classification using the SILVA database in conjunction with MAPSeq which offers fast and accurate classification of reads, and provides corresponding confidence scores for assignment at each taxonomic level.

MGnify can also provide analysis of ITS (internal transcribed spacer) amplicons. ITS1 and ITS2 reside between the LSU and SSU genes and can be targeted for accurate classification of eukaryotic organisms. ITS taxonomy is assigned by MAPseq using two reference databases: ITSoneDB containing ITS1 sequences and UNITE containing ITS1 and ITS2 sequences. The SSU and LSU regions are masked using Rfam, as described above, prior to ITS classification, minimising cross reactivity.

Figure 1: Overview of the main steps in the amplicon workflow.

Raw reads analysis pipeline

Metagenomic and metatranscriptomic raw reads undergo merging, quality control and SSU/LSU based taxonomic analysis, as described for the amplicon pipeline above. Additional non-coding RNAs (ncRNAs) are identified with Infernal, using families from the following Rfam clans: CL00001 (tRNA), CL00002 (RNAse) and CL00003 (SRP). Supplementary phylogenetic classification based on marker gene profiling, is performed using mOTUs2 on the quality controlled reads.

For functional analysis, the sequence regions encoding rRNAs are masked, and FragGeneScan is used to predict coding sequences (pCDS). Coding sequences are assigned protein annotations with InterProScan, using 5 member databases that are able to process large numbers of potentially fragmented sequences (Gene3D, TIGRFAMs, Pfam, PRINTS and PROSITE patterns). Pfam annotations are provided as separate visualisations and downloads. GO terms are extracted from the InterProScan results and grouped according to category (Biological Process, Molecular Function and Cellular Component). GO terms are also summarized using a specialized GO Slim developed for metagenomic data. Finally, protein coding sequences undergo KEGG ortholog annotations using HMMER v3.2.1 and a modified version of KOfam 2019-04-06 (based on KEGG 90.0).

Figure 2: Overview of the main steps in the raw reads workflow.

Assembly analysis pipeline

Users can request assembly of their own raw sequencing reads, or publicly available datasets, using the ‘Request analysis’ section of the MGnify home page. Users own raw reads (with host sequences removed) must be archived in ENA before submitting an assembly request. The sequences then undergo quality control, as well as a precautionary additional host contamination removal process (where applicable) with bwa-mem. metaSPAdes is used for assembly of paired end reads and SPAdes for single reads. Alternatively, pre-assembled datasets, including those produced using other assembly algorithms, can be analysed. Quality control for assemblies is based on sequence length, with contigs less than 500 nucleotides removed from the analysis process.

rRNAs are identified and undergo taxonomic analysis as for raw reads above. Sequence regions encoding rRNAs are masked and protein coding sequences are predicted using a combined gene caller that utilises both Prodigal and FragGeneScan. In addition to rRNA-based taxonomic analyses, DIAMOND is used to assign taxonomy to protein sequences, based on the top hit to the UniRef90 database.

Protein function is assigned in the form of InterProScan annotations, GO terms, and KEGG ortholog predictions, as described for the raw reads analysis pipeline above. Additionally, clusters of orthologous groups (COGs) annotations and eggNOG functional descriptions are provided by the eggNOG-mapper tool.

KEGG ortholog annotations are further processed to produce KEGG pathway information, including module presence and completeness. Similarly, InterPro annotations for individual protein sequences are amalgamated to generate Genome Properties (GP), providing inference of higher level pathways and systems that may be present in the dataset. Finally, antiSMASH is used to identify and annotate biosynthetic gene clusters that code for the production of secondary metabolites.

Figure 3: Overview of the main steps in the assembly workflow.

Citation

BibTeX citation:
@online{2024,
  author = {, MGnify},
  title = {Analysis Pipeline V5.0},
  pages = {undefined},
  date = {2024-10-03},
  url = {https://docs.mgnify.org/src/docs/analysis.html},
  langid = {en}
}
For attribution, please cite this work as:
MGnify. 2024. “Analysis Pipeline V5.0.” October 3, 2024. https://docs.mgnify.org/src/docs/analysis.html.