Website and portal
Sections of the MGnify website
MGnify’s Studies, Samples, and Analyses are indexed so that they can be searched for by name, description, accession, biome and more. This is the quickest way to get started finding metagenomics-derived data.
The Browse Data section of the MGnify website provides searchable listings of the core datasets.
See the following section for full details.
You can submit your own metagenomic data for analysis using our pipeline, or ask us to analyse an existing public dataset.
If you submit private data, once it is analysed you’ll find it by logging into the My Data section.
For more information, see the dataflow section.
The API is the basis of programmatic access to MGnify’s database.
For more information, see the API section.
Organisation and access of data on the MGnify website
The “Browse data” section of the MGnify website provides searchable listings of the core datasets.
In MGnify, datasets are listed in tables (which can also be downloaded, or queried programmatically using the API). There is a detail page available for most data types, accessed by clicking it in the table (e.g. on its ID, accession, name, or a view button).
Super Studies are collections of Studies that together represent the output of major metagenomic research efforts or consortia. Clicking a Super Study’s title in the table reveals a listing of all of its associated Studies.
Studies within MGnify are directly related to Studies/Projects in ENA. To appear in MGnify, an ENA Study is submitted for analysis by the MGnify pipeline.
They represent a collection of data generated by a research project, including any Publications produced, any Samples collected and sequenced, and the MGnify analyses run on them.
MGnify Studies are accessioned with an MGYS number.
Samples within MGnify are pulled directly from ENA. They represent (the genetic sequencing of) a single real-world specimen from a particular biome. Samples have accessions assigned by the source database, e.g. prefixed ERS by ENA.
The sequencing runs, assemblies and analyses can be explored for each sample, by clicking the sample in the Browse Samples list. More details about the information available are provided throughout this documentation.
Publications within MGnify are linked Europe PubMed Central (Europe PMC). They represent the literature output of metagenomic projects. In MGnify, Publications are associated with Studies. In the Publications list, click the PubMed ID (PMID) to visit the Europe PMC page for the publication, or click “View details” to explore the related MGnify data for it. This includes the MGnify Studies associated with the Publication, as well as additional Metadata (see Metadata section below).
Genomes within MGnify are metagenomic-assembled genomes (MAGs) organised into biome-specific catalogues (in some cases alongside a small number of isolate genomes). There is a separate page of documentation for the Genomes resource.
Viewing metadata for MGnify Samples, Studies, Publications
Sample metadata from ENA
Biome and Location metadata are visualised, and other user-provided metadata are listed.
Sample metadata from BioSamples
Sample metadata from the Elixir Contextual Data Clearing House
These metadata take the form of curations: either adding additional sample metadata beyond that stored in the ENA submission, or sometimes correcting existing metadata.
Like the Sample metadata from ENA, they’re shown as a list of Key:Value pairs.
Additional metadata from text-mining on Publications
MGnify present additional metadata that may be relevant to a Study or Sample using automated text-mining on Publications, provided by Europe PMC. MGnify fetches any Metagenomics-relevant annotations (provided by EMERALD). These annotations can surface additional metadata that was not attached to samples when they were deposited in ENA, but that was mentioned in the publications describing them.
These can be explored within MGnify on a Publication page, or in the list of Publications on a Study page.
It is impossible to automatically and confidently determine which sample(s) a particular publication annotation may refer to. However, on the Sample pages within MGnify the existence of any potential additional metadata from associated studies’ publications is highlighted beneath the main metadata listing.
Most open-access publications listed on the MGnify website will have some annotations. The release cycle for annotating newly added publications is every 3 months.
Content of the ‘Associated runs’ table on project page
This table lists all samples and runs associated with a project as well as the experiment type (Amplicon, Assembly, Metabarcoding, Metagenomic or Metatranscriptomic), sequencing instrument model and pipeline version for each individual run. In addition, the last field displays links to analysis results and download pages (the latter being represented by the icon).
Finding quality control information about runs on the MGnify website
Quality control (QC) analysis of runs within projects on the MGnify website can be accessed by selecting the ‘Quality control’ tab found toward the top of any run page (see Figure 1 below).
Selecting this tab brings up a page containing four graphical representations: a count of reads/contigs remaining pre and post QC, a histogram of minimum, maximum and average sequence length (post QC), distribution of GC content and the first 500 nucleotides (post QC). These are available to download via the ‘Download’ tab found toward the top of any run page (see Figure 8 below).
Finding functional information about runs on the MGnify website
Functional analysis of runs within projects on the MGnify website can be accessed by selecting the ‘Functional Analysis’ tab found toward the top of any run page (see Figure 4 below). Note that this tab will be greyed for amplicon runs that have no functional results.
Below this first bar chart, there are 4 tabs with different types of functional annotation:
Finding pathways/systems information about runs on the MGnify website
Pathway and system annotations of runs within projects on the MGnify website can be accessed by selecting the ‘Pathways/Systems’ tab found toward the top of any run page. Note that this tab will only be accessible for assembly analysis.
There are 3 types of pathway and system annotations:
Viewing functional annotation per contig
This feature is available for assembly analysis only and can be found in the tab ‘Contig Viewer’.
Finding taxonomic information about runs on the MGnify website
Taxonomic analysis of runs within projects on the MGnify website can be accessed by selecting the ‘Taxonomic analysis’ tab found toward the top of any run page (see Figure 7 below).
The taxonomic analysis results are displayed as Krona plot. This feature allows users to explore the taxonomic results and to zoom in on a particular taxonomic level by double clicking on it. The corresponding distribution charts are displayed on the right hand side of the panel.
Alternative pie-, bar- and stacked-chart representations can be generated by clicking on the ‘Switch view’ icons located above the Krona plot, however data are then presented at the phylum level for clarity.
Files available to download on the MGnify website
The full data sets used to generate the graphs, along with a host of additional data and intermediate files can be downloaded for further analysis by clicking the ‘Download’ tab, found towards the top of the page.
Some of the files, particularly the sequence files in FASTA format, can be large. To facilitate the download process, these files are compressed with GZIP and when too large to be easily transferable, chunked into a manageable size. If it is the case for your runs, please download all chunks, decompress them and concatenate them to reconstitute the full file. Ensure the chunks are concatenated in the order given on the download page, as headers will be in the first chunked file.
Description of fasta files available to download
Processed nucleotide reads OR Processed contigs: this file contains all reads/contigs having passed the quality control (QC) step.
Predicted CDS: this file contains protein sequences that have pCDS.
Predicted ORF: this file contains nucleotide sequences that have pCDS.
Description of functional annotation files available to download
InterPro matches: A tab-delimited file containing 15 columns. They are fully described here
Pfam annotation: summary of Pfam annotations and their frequencies.
Complete GO annotation: summary of GO term annotations in 4 columns: GO terms (labelled GO:XXXXXXX), GO term description, GO category (biological process, molecular function or cellular location) and number of pCDS annotated with a GO term.
GO slim annotation file: this file is derived from the ‘Complete GO annotation file’ and has the same format.
DIAMOND annotation: a tab-delimited file containing 16 columns with Uniref IDs and taxonomic annotation of protein sequences.
KEGG orthologues annotation: summary of KEGG ortholog annotations and their frequencies.
Description of pathway and system annotation files available to download
antiSMASH annotation: EMBL flatfile and GenBank formatted files with annotations per contig.
Genome Properties annotation: summary of genome properties and and their frequency.
KEGG pathway annotation: summary of KEGG modules, pathway names and completeness.
Description of taxonomic assignment files available to download
Reads/Contigs encoding…: All reads predicted to encode for LSU, SSU, ITS or any other non-coding RNAs (ncRNAs). LSU, SSU and ncRNAS are predicted with Infernal. ITS have the predicted LSU and SSU sequences masked.
MAPseq assignments: this file contains the output from mapseq - a taxonomic assignment where applicable for each input sequence.
OTUs, counts and taxonomic assignments (TSV): this file contains a taxonomic lineage column followed by the frequency of it’s annotation and the corresponding NCBI taxid (not available for UNITE). This file can be directly imported into Megan6 for visualisation and further analysis.
In addition to the output files for individual runs, described above, MGnify provides a number of summary files available via the ‘Analysis summary’ tab on the project page (Figure 9 below). They summarize the counts per feature across all runs of a study and therefore provide an easy way to identify patterns. The summary files are split between functional (not available for amplicon-only study) and taxonomy sections.
Data discovery on MGnify portal
MGnify is the largest metagenomic resource of public datasets. In order to help users access the data present in the portal, MGnify offers a powerful search tool and a range of browsing options.
The Search tool is underpinned by EBI search and accessible via any MGnify page (Figure 11 below).
The search page contains 3 tabs allowing users to navigate between studies, samples and analysis search levels. In each tab, the left hand side panel provides a number of facets that can be used to restrict the search space.
at the study level, the search can be restricted by ‘biome’ and ‘centre name’. Selection of any of the facets will also impact the search at sample and analysis level. Search results can be downloaded as a tab-separated file.
at the sample level, in addition to ‘biome’, the choice of facets includes ‘temperature’, ‘depth’, ‘experiment type’, ‘sequencing method’, ‘location name’, ‘disease status’ and ‘phenotype’, when provided. Note that these metadata are provided by the data submitter and are not curated.
at the analysis level, users can restrict their searches according to ‘biome’, ‘temperature’, ‘depth’, ‘pipeline version’, ‘organism’, ‘experiment type’ as well as GO and InterPro terms.
The MGnify homepage ‘Search by’ and ‘Latest studies’ sections have several browsing options to easily navigate publicly available annotated data: - Links to all studies, samples, analyses or experiment types will redirect users to the search page, where more filtering criteria is available. - There is also an option to browse by selected biomes. A subset of biome images with public samples are shown on the homepage. The ‘Browse all biomes’ link will open an expanded list. Upon selection, a table giving the hierarchical lineage according to GOLD database classification is provided, with the number of projects associated with each lineage. - Any links in the latest studies section will redirect the user to the selected public project and all it’s available samples, runs and analysis.
The ‘Browse data’ tab allows users to search by super-studies, studies, samples or publications. Each search option has a text-based or biome filter. ‘Download results’ will return a csv of the search summary.
If you have given consent to the MGnify team to analyse your data for which you have requested a pre-publication confidential hold, you can access the analysis results of those pre-published data sets using your private login area. You can access this area by clicking on the ‘Login’ button, which you will find on the top right hand side of any page (see Figure 12 below).
After you have successfully logged into our system, you will have direct access to all your privately (and publicly) submitted projects and samples. You will find a list of your latest submissions (projects and samples) on the home page, but you have also access to all your submitted projects so far on the projects list view (Figure 13 below). On that page you will find a drop down filter item ‘My projects’, which allows you to list all your projects.