Website and portal

Sections of the MGnify website

_images/mgnify-home-page.png

Figure 1 Homepage of the MGnify website.

Browse Data

The Browse Data section of the MGnify website provides searchable listings of the core datasets.

See the following section for full details.

Submit or request data, and My Data

You can submit your own metagenomic data for analysis using our pipeline, or ask us to analyse an existing public dataset.

If you submit private data, once it is analysed you’ll find it by logging into the My Data section.

For more information, see the dataflow section.

API

The API is the basis of programmatic access to MGnify’s database.

For more information, see the API section.

Organisation and access of data on the MGnify website

The “Browse data” section of the MGnify website provides searchable listings of the core datasets.

The data types in MGnify are linked to those in the European Nucleotide Archive. The ENA’s documentation includes a listing of ENA data domains.

In MGnify, datasets are listed in tables (which can also be downloaded, or queried programmatically using the API). There is a detail page available for most data types, accessed by clicking it in the table (e.g. on its ID, accession, name, or a view button).

Super Studies

Super Studies are collections of Studies that together represent the output of major metagenomic research efforts or consortia. Clicking a Super Study’s title in the table reveals a listing of all of its associated Studies.

Studies

Studies within MGnify are directly related to Studies/Projects in ENA. To appear in MGnify, an ENA Study is submitted for analysis by the MGnify pipeline.

They represent a collection of data generated by a research project, including any Publications produced, any Samples collected and sequenced, and the MGnify analyses run on them.

MGnify Studies are accessioned with an MGYS number.

Samples

Samples within MGnify are pulled directly from ENA. They represent (the genetic sequencing of) a single real-world specimen from a particular biome. Samples have accessions assigned by the source database, e.g. prefixed ERS by ENA.

The sequencing runs, assemblies and analyses can be explored for each sample, by clicking the sample in the Browse Samples list. More details about the information available are provided throughout this documentation.

Publications

Publications within MGnify are linked Europe PubMed Central (Europe PMC). They represent the literature output of metagenomic projects. In MGnify, Publications are associated with Studies. In the Publications list, click the PubMed ID (PMID) to visit the Europe PMC page for the publication, or click “View details” to explore the related MGnify data for it. This includes the MGnify Studies associated with the Publication, as well as additional Metadata (see Metadata section below).

Genomes

Genomes within MGnify are metagenomic-assembled genomes (MAGs) organised into biome-specific catalogues (in some cases alongside a small number of isolate genomes). There is a separate page of documentation for the Genomes resource.

Viewing metadata for MGnify Samples, Studies, Publications

Sample metadata from ENA

The detail page for a Sample in MGnify shows metadata sourced from ENA.

Biome and Location metadata are visualised, and other user-provided metadata are listed.

_images/sample-metadata.png

Figure 2 Metadata listings for a MGnify Sample. (1) = a list of ENA-provided metadata. (2) = the Biome is highlighted. (3) = the sample location is mapped.

Sample metadata from BioSamples

The Sample detail pages also show a link to an entry in the EBI BioSamples database, where further metadata may be found.

Sample metadata from the Elixir Contextual Data Clearing House

Some Samples have additional metadata from Elixir’s Contextual Data Clearing House.

These metadata take the form of curations: either adding additional sample metadata beyond that stored in the ENA submission, or sometimes correcting existing metadata.

Like the Sample metadata from ENA, they’re shown as a list of Key:Value pairs.

_images/metadata-cdch.png

Figure 3 Additional Sample metadata from the Contextual Data Clearing House.

Additional metadata from text-mining on Publications

MGnify present additional metadata that may be relevant to a Study or Sample using automated text-mining on Publications, provided by Europe PMC. MGnify fetches any Metagenomics-relevant annotations (provided by EMERALD). These annotations can surface additional metadata that was not attached to samples when they were deposited in ENA, but that was mentioned in the publications describing them.

These can be explored within MGnify on a Publication page, or in the list of Publications on a Study page.

_images/europe-pmc-annotations.png

Figure 4 Annotations for publications are made available within MGnify, in a drill-down format.

It is impossible to automatically and confidently determine which sample(s) a particular publication annotation may refer to. However, on the Sample pages within MGnify the existence of any potential additional metadata from associated studies’ publications is highlighted beneath the main metadata listing.

Most open-access publications listed on the MGnify website will have some annotations. The release cycle for annotating newly added publications is every 3 months.

Content of the ‘Associated runs’ table on project page

This table lists all samples and runs associated with a project as well as the experiment type (Amplicon, Assembly, Metabarcoding, Metagenomic or Metatranscriptomic), sequencing instrument model and pipeline version for each individual run. In addition, the last field displays links to analysis results and download pages (the latter being represented by the icon icon).

Finding quality control information about runs on the MGnify website

Quality control (QC) analysis of runs within projects on the MGnify website can be accessed by selecting the ‘Quality control’ tab found toward the top of any run page (see Figure 1 below).

_images/qc1.png

Figure 5 A ‘Quality control’ tab can be found towards the top of each run page.

Selecting this tab brings up a page containing four graphical representations: a count of reads/contigs remaining pre and post QC, a histogram of minimum, maximum and average sequence length (post QC), distribution of GC content and the first 500 nucleotides (post QC). These are available to download via the ‘Download’ tab found toward the top of any run page (see Figure 8 below).

_images/qc_metag.png

Figure 6 Typical even nucleotide distribution expected for metagenome, metatranscriptome and assembly datasets. Note that the stretch of uneven distribution observed until position 20 are indicative that the sequencing adapters had not been completely removed in the submitted reads.

_images/qc_ndamplicon.png

Figure 7 Typical uneven nucleotide distribution expected for an amplicon dataset.

Finding functional information about runs on the MGnify website

Functional analysis of runs within projects on the MGnify website can be accessed by selecting the ‘Functional Analysis’ tab found toward the top of any run page (see Figure 4 below). Note that this tab will be greyed for amplicon runs that have no functional results.

_images/func1-amplicon.png

Figure 8 A Functional analysis tab can be found towards the top of each run page. Selecting this tab brings up a page displaying sequence features (‘Predicted CDS’, ‘Contigs with predicted CDS’ and ‘Contigs with predicted rRNA’)

Below this first bar chart, there are 4 tabs with different types of functional annotation:

A

_images/func1-v5.png

B

_images/func2-v5.png

C

_images/func3-v5.png

D

_images/func4-v5.png

Figure 9 Functional analysis of metagenomics data, as shown on the MGnify website. A) InterPro match information for the predicted coding sequences in the run is shown. The number of InterPro matches are displayed graphically, and as a downloadable table with links to corresponding InterPro entries. B) Predicted GO slim terms are displayed. Different graphical representations are available, and can be selected by clicking on the ‘Switch view’ icons. C) A table of Pfam matches for predicted coding sequences with a bar graph showing the top 10 hits. D) A table of KEGG ortholog matches for predicted coding sequences with a bar graph showing the top 10 hits.

Finding pathways/systems information about runs on the MGnify website

Pathway and system annotations of runs within projects on the MGnify website can be accessed by selecting the ‘Pathways/Systems’ tab found toward the top of any run page. Note that this tab will only be accessible for assembly analysis.

There are 3 types of pathway and system annotations:

A

_images/path1-v5.png

B

_images/path2-v5.png

C

_images/path3-v5.png

Figure 10 Annotation of potential pathways and high order system classification, as shown on the MGnify website. A) A table and bar graph of KEGG modules derived from KEGG orthologs, with pathway completeness values. B) An expandable list of present Genome Properties, grouped by top level systems, derived from InterProScan outputs. C) A table of antiSMASH hits with a bar graph showing the top 10 hits.

Viewing functional annotation per contig

This feature is available for assembly analysis only and can be found in the tab ‘Contig Viewer’.

A

_images/contig1-v5.png

B

_images/contig2-v5.png

Figure 11 Interactive contig viewer for localised visualisation of functional annotation per contig. A) The main page contains a table of contigs with annotations, length and coverage. Text search and tickboxes allow users to search for functional annotations by method. B) Hover over each coding sequence to see functional annotation with external links, and protein length for that region.

Finding taxonomic information about runs on the MGnify website

Taxonomic analysis of runs within projects on the MGnify website can be accessed by selecting the ‘Taxonomic analysis’ tab found toward the top of any run page (see Figure 7 below).

_images/taxonomy.png

Figure 12 A ‘Taxonomic analysis’ tab can be found towards the top of each run page. Selecting this tab brings up a page displaying the taxonomic results displayed as an interactive Krona plot.

The taxonomic analysis results are displayed as Krona plot. This feature allows users to explore the taxonomic results and to zoom in on a particular taxonomic level by double clicking on it. The corresponding distribution charts are displayed on the right hand side of the panel.

Alternative pie-, bar- and stacked-chart representations can be generated by clicking on the ‘Switch view’ icons located above the Krona plot, however data are then presented at the phylum level for clarity.

Files available to download on the MGnify website

The full data sets used to generate the graphs, along with a host of additional data and intermediate files can be downloaded for further analysis by clicking the ‘Download’ tab, found towards the top of the page.

_images/download_1-v5.png

Figure 13 The Download tab is organised into sections: ‘Sequence data’, ‘Functional analysis’ (not available in the case of amplicon runs), ‘Pathways and Systems’ (available only for assemblies), ‘Taxonomic analysis SSU’, ‘Taxonomic analysis LSU’, ‘Taxonomic analysis ITS’ (available for amplicon only) and ‘non-coding RNAs’ (will only exist if any non coding RNAs are identified).

Some of the files, particularly the sequence files in FASTA format, can be large. To facilitate the download process, these files are compressed with GZIP and when too large to be easily transferable, chunked into a manageable size. If it is the case for your runs, please download all chunks, decompress them and concatenate them to reconstitute the full file. Ensure the chunks are concatenated in the order given on the download page, as headers will be in the first chunked file.

Description of fasta files available to download

  • Processed nucleotide reads OR Processed contigs: this file contains all reads/contigs having passed the quality control (QC) step.

  • Predicted CDS: this file contains protein sequences that have pCDS.

  • Predicted ORF: this file contains nucleotide sequences that have pCDS.

Description of functional annotation files available to download

  • InterPro matches: A tab-delimited file containing 15 columns. They are fully described here

  • Pfam annotation: summary of Pfam annotations and their frequencies.

  • Complete GO annotation: summary of GO term annotations in 4 columns: GO terms (labelled GO:XXXXXXX), GO term description, GO category (biological process, molecular function or cellular location) and number of pCDS annotated with a GO term.

  • GO slim annotation file: this file is derived from the ‘Complete GO annotation file’ and has the same format.

  • DIAMOND annotation: a tab-delimited file containing 16 columns with Uniref IDs and taxonomic annotation of protein sequences.

  • KEGG orthologues annotation: summary of KEGG ortholog annotations and their frequencies.

Description of pathway and system annotation files available to download

  • antiSMASH annotation: EMBL flatfile and GenBank formatted files with annotations per contig.

  • Genome Properties annotation: summary of genome properties and and their frequency.

  • KEGG pathway annotation: summary of KEGG modules, pathway names and completeness.

Description of taxonomic assignment files available to download

  • Reads/Contigs encoding…: All reads predicted to encode for LSU, SSU, ITS or any other non-coding RNAs (ncRNAs). LSU, SSU and ncRNAS are predicted with Infernal. ITS have the predicted LSU and SSU sequences masked.

  • MAPseq assignments: this file contains the output from mapseq - a taxonomic assignment where applicable for each input sequence.

  • OTUs, counts and taxonomic assignments (TSV): this file contains a taxonomic lineage column followed by the frequency of it’s annotation and the corresponding NCBI taxid (not available for UNITE). This file can be directly imported into Megan6 for visualisation and further analysis.

  • OTUs, counts and taxonomic assignments (HDF5/JSON) - two files for each type of rRNA or ITS database. These contain the same taxonomic information as the TSV files in JSON and HDF5 formats. The Biom files are computer-readable files. The HDF5 (Hierarchical Data Format) format can be imported into analysis and visualisation tools such as Matlab and R. A larger number of commercial and freely available tools, such as MEGAN6, can consume the JavaScript Object Notation (JSON) format.

Summary files

In addition to the output files for individual runs, described above, MGnify provides a number of summary files available via the ‘Analysis summary’ tab on the project page (Figure 9 below). They summarize the counts per feature across all runs of a study and therefore provide an easy way to identify patterns. The summary files are split between functional (not available for amplicon-only study) and taxonomy sections.

_images/summary.png

Figure 14 The ‘Analysis summary’ tab is organised in 2 sections: ‘Functional analysis for the project’ and ‘Taxonomic analysis for the project’ (the former is not available in the case of amplicon runs).

Data discovery on MGnify portal

MGnify is the largest metagenomic resource of public datasets. In order to help users access the data present in the portal, MGnify offers a powerful search tool and a range of browsing options.

Text search

The Search tool is underpinned by EBI search and accessible via any MGnify page (Figure 11 below).

_images/search.png

Figure 15 The ‘Text search’ can be accessed using the button located on the MGnify banner. The search space can be restricted by free-text using the ‘Search’ field below the header of this page.

The search page contains 3 tabs allowing users to navigate between studies, samples and analysis search levels. In each tab, the left hand side panel provides a number of facets that can be used to restrict the search space.

  • at the study level, the search can be restricted by ‘biome’ and ‘centre name’. Selection of any of the facets will also impact the search at sample and analysis level. Search results can be downloaded as a tab-separated file.

  • at the sample level, in addition to ‘biome’, the choice of facets includes ‘temperature’, ‘depth’, ‘experiment type’, ‘sequencing method’, ‘location name’, ‘disease status’ and ‘phenotype’, when provided. Note that these metadata are provided by the data submitter and are not curated.

  • at the analysis level, users can restrict their searches according to ‘biome’, ‘temperature’, ‘depth’, ‘pipeline version’, ‘organism’, ‘experiment type’ as well as GO and InterPro terms.

Browsing options

The MGnify homepage ‘Search by’ and ‘Latest studies’ sections have several browsing options to easily navigate publicly available annotated data: - Links to all studies, samples, analyses or experiment types will redirect users to the search page, where more filtering criteria is available. - There is also an option to browse by selected biomes. A subset of biome images with public samples are shown on the homepage. The ‘Browse all biomes’ link will open an expanded list. Upon selection, a table giving the hierarchical lineage according to GOLD database classification is provided, with the number of projects associated with each lineage. - Any links in the latest studies section will redirect the user to the selected public project and all it’s available samples, runs and analysis.

The ‘Browse data’ tab allows users to search by super-studies, studies, samples or publications. Each search option has a text-based or biome filter. ‘Download results’ will return a csv of the search summary.

Private area

If you have given consent to the MGnify team to analyse your data for which you have requested a pre-publication confidential hold, you can access the analysis results of those pre-published data sets using your private login area. You can access this area by clicking on the ‘Login’ button, which you will find on the top right hand side of any page (see Figure 12 below).

_images/how_to_login.png

Figure 16 A login dialog will open once you have clicked on the ‘Login’ button, which can be found on the right top corner of each page.

After you have successfully logged into our system, you will have direct access to all your privately (and publicly) submitted projects and samples. You will find a list of your latest submissions (projects and samples) on the home page, but you have also access to all your submitted projects so far on the projects list view (Figure 13 below). On that page you will find a drop down filter item ‘My projects’, which allows you to list all your projects.

_images/my_projects_cu.png

Figure 17 Filter options on the projects list view.