GSC23 MGnify Workshop


Tatiana Gurbich

MGnify team at EMBL-EBI

Sandy Rogers

MGnify team at EMBL-EBI

Virginie Grosboillot

ETH Zurich

This is a static preview

You can run and edit these examples interactively on Galaxy

GSC23 MGnify Workshop Advanced Session - Practical Exercise


  • become familiar with the MGnify API and learn to access it programmatically
  • explore the MGnify Genomes resource using Python

How this notebook works:

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter.

This notebook is divided into 3 sections:

  • Section 1 focuses on MGnify Analyses (metagenomic datasets and assemblies analysed by MGnify)
  • Section 2 focuses on the MGnify Genomes resource (the genome catalogues)
  • Bonus section allows you to practice writing your own code to query the API

Import packages

Execute the code below to import the Python libraries required to run the examples presented in this notebook.

# Connection to MGnify API
from jsonapi_client import Session as APISession
from jsonapi_client import Modifier  # allows us to query specific values in given fields (e.g.: 'taxon-lineage').
import requests
from itertools import islice

# Dataframes and display
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

# Plots
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
%matplotlib inline 

# Data download
from urllib.request import urlretrieve

# Warning verbosity
import warnings 

Section 1: explore MGnify analysis results programmatically

In the first portion of today’s workhop you learned about MGnify’s analysis of metagenomic data. In this exercise we will fetch and explore MGnify analysis results programmatically. These are just a few examples of the types of commands and operations that can be peformed.

Task 1: Browse the recently analysed studies and save them to a file

Let’s start by loading the 50 most recently analysed studies and saving them to a file called studies.csv (it would take too long to load all studies for this example so we limit the number at 50):

endpoint = 'studies'

with APISession("") as mgnify:
    resources = map(lambda r: r.json, islice(mgnify.iterate(endpoint), 50))
    resources = pd.json_normalize(resources)

Click on the three dots below to open up a line by line explanation of this code block.

endpoint = 'studies'
# An "endpoint" is the specific resource within the API which we want to get data from. 
# It is the URL relative to the "server base URL" of the API, which for MGnify is 
# You can find the endpoints in the API Docs 

with APISession("") as mgnify:
    # Calling "APISession" is enabling a connection to the MGnify API, that can be used multiple times. 
    # The ` mgnify` syntax is a Python "context". 
    # Everything inside the `with...` block (i.e. indented lines below it) can use the APISession 
    # which we've called mgnify here. 
    # When the `with` block closes (the indentation stops), the connection to the API is cleaned up.

    resources = map(lambda r: r.json, islice(mgnify.iterate(endpoint), 50))
    # `map` applies a function to every element of an iterable - so do something to each thing in a list.
    # The MGnify API is paginated, which means query results are broken up into pages for easier handling.
    # `mgnify.iterate(endpoint)` is a very helpful function that loops over as many pages of results as 
    # there are.
    # `lambda r: r.json` is grabbing the JSON attribute from each study returned from the API.
    # The islice() function from the itertools module is used to limit the number of results to 50. 
    # It ensures that only the first 50 items will be processed. We added this to avoid loading all of 
    # the studies, which would take a long time.
    # All together, this makes `resources` be 50 JSON representations we could loop through, each containing 
    # the data of a study.

    resources = pd.json_normalize(resources)
    # pd is the shorthand for the pandas package - you'll see it anywhere people are using pandas.
    # The json_normalize function takes "nested" data and turns it into a table.

    # Pandas has a built-in way of writing CSV (or TSV, etc) files, which is helpful for getting data into 
    # other tools. This writes the table-ified study list to a file called studies.csv.

Studies have been saved to file. Double-click on the file name studies.csv in the left-hand panel to open it. You can see the list of study accessions and information about each of them.

Task 2: Explore a selected a study

Pick a study to load and explore in more detail. In the code below we picked study MGYS00001935 as the default, however, you can choose a different study accession from the MGnify website or from the studies.csv file you just generated. When you execute the code block below, a prompt window will appear. When that happens, paste the study accession you chose into the window and press ‘Enter’. To use the default, press ‘Enter’ without pasting anything into the window.

from lib.variable_utils import get_variable_from_link_or_input

accession = get_variable_from_link_or_input('MGYS', 'Study Accession', 'MGYS00001935')

# you could also assign the accession value directly:
# accession = 'MGYS00001935'

Fetch data for the selected study

Load analyses from this study into a Pandas dataframe. Pandas is a Python library that is widely used to work with and manipulate data. A dataframe is a table with rows and columns. Note that we will be using a different endpoint here because now we are fetching analyses rather than studies. You will also notice that endpoints can go several levels beyond the original URL. In this example it’s original URL + studies + “specific study accession” + analyses

endpoint = f"studies/{accession}/analyses"

with APISession("") as mgnify:
    analyses = map(lambda r: r.json, mgnify.iterate(endpoint))
    analyses = pd.json_normalize(analyses)

Inspect the data

Analyses from the study are now loaded into the dataframe analyses. The .head() method prints the first few rows of the table:


Exploring the data - Example 1: distribution of sequencing instruments used for the analysed samples

Let’s make a plot showing what instrument type(s) were used to sequence the samples in the study we just loaded.

analyses.groupby('attributes.instrument-model').size().plot(kind='pie', autopct='%1.1f%%')
plt.title('Percentages of analysed samples by instrument type');

Click on the three dots below to open up an explanation of how this code works.

analyses.groupby('attributes.instrument-model'): the groupby operation here groups the data in the dataframe analyses based on the values in the column attributes.instrument-model. That means it groups the data according to the instrument models mentioned in that column.

The .size() function is applied after grouping to count the number of occurrences (size) of each group. It counts how many samples were analyzed using each instrument model.

.plot(kind='pie', autopct='%1.1f%%'): this line creates a pie chart from the grouped and counted data. The kind='pie' parameter specifies that a pie chart should be plotted. The autopct='%1.1f%%' parameter formats the percentage values in the pie chart to show one decimal place.

plt.title('Percentages of analysed samples by instrument type'): this line sets the title of the pie chart to “Percentages of analysed samples by instrument type.”

Exploring the data - Example 2: filtering results

You might have noticed while exploring the dataframe analyses above that the column attributes.analysis-summary contains JSONs as its values. Let’s run a Python code to add the information from these JSONs as new columns to the dataframe:

# First we define a function which we will call "analysis_summary_to_df" that will 
# take the attributes.analysis-summary column, convert it into a dataframe and then 
# return it as a Pandas pivot table, which is a table that summarises the dataframe
def analysis_summary_to_df(analysis):
    summary_df = pd.json_normalize(analysis['attributes.analysis-summary'])
    if summary_df.size:
        return summary_df.pivot_table(index=None, columns="key", values="value", aggfunc="first")
    return summary_df

# Add a column "summary_df" to the dataframe "analyses". The column is generated using
# the function "analysis_summary_to_df" that we defined above and will hold the data
# from the JSONs that we are converting here.
analyses['summary_df'] = analyses.apply(analysis_summary_to_df, axis=1)

# Now we will generate a new dataframe called analyses_summary_df
analyses_summary_df = None

# Iterate through the rows of the analyses dataframe
for idx, row in analyses.iterrows():
    # Set the index of the summary_df dataframe to the index of the line we are iterating through
    row.summary_df.index = [idx]
    # Add lines from summary_df to analyses_summary_df
    if analyses_summary_df is None:
        analyses_summary_df = row.summary_df
        analyses_summary_df = pd.concat([analyses_summary_df, row.summary_df])

# Concatenate the new analyses_summary_df dataframe (which has the data from JSON in column form)
# and our original analyses dataframe and save them to a new dataframe called transformed_analyses
transformed_analyses = pd.concat([analyses, analyses_summary_df], axis=1)

# Remove the attributes.analysis-summary and summary_df columns, we no longer need them
transformed_analyses.drop(['attributes.analysis-summary', 'summary_df'], axis=1, inplace=True)

# View the first few lines of the updated dataframe

Scroll to the right to see the newly added fields.

Let’s filter the results to only show analyses that have at least 1 predicted LSU (large ribosomal subunit) sequence:

# Create a new dataframe called filtered_analyses which will include all lines from the
# transformed_analyses dataframe except for the ones where the value in the "Predicted LSU sequences"
# column is "0" or no value (NaN)
filtered_analyses = transformed_analyses[
    ~(transformed_analyses['Predicted LSU sequences'].isin(['0']))

filtered_analyses = filtered_analyses.dropna(subset=["Predicted LSU sequences"])

# print the filtered dataframe
Q: How many analyses have at least 1 predicted LSU?

Hint: look at the dataframe dimensions that are printed below the dataframe. What is the number of rows? Note that some studies might not have any predicted LSU sequences and the dataframe above could be empty.
You could also find the number of rows in the dataframe by executing the code:


You can similarly filter the results using any of the other columns depending on what you are interested in.

Exploring the data - Example 3: download the analysis results files

Now let’s fetch the actual results files. To do that we will use the /analyses/{analysisId}/downloads endpoint. For the purposes of this exercise we will download the results for only the first 2 analyses from the unfiltered dataframe analyses but in your own work you could download them all or download only the ones that passed the filters that you applied.

There are different result files available, let’s first check which ones we could download:

with APISession("") as session:
     for analysisId in analyses.head(2)['attributes.accession']:
         print(f"\nFiles available for analysis {analysisId}:")
         for download in session.iterate(f"analyses/{analysisId}/downloads"):
                print(f"{download.alias}: {download.description.label}")

As an example, let’s download only the tables containing the taxonomic assignments for SSU rRNA.

with APISession("") as session:    
     for analysisId in analyses.head(2)['attributes.accession']:
         print(f"Processing: {analysisId}")
         for download in session.iterate(
            # Start another for loop to go over the files to download
             OTUFiles = f"{analysisId}_MERGED_FASTQ_SSU_OTU.tsv"
             if (
                 download.description.label =='OTUs, counts and taxonomic assignments for SSU rRNA'
        == 'TSV'
                 print(f"Downloading file for {analysisId}")
                 urlretrieve(download.links.self.url, OTUFiles)
         print("Finished for", analysisId)

Click on the three dots below to see an explanation for how this code works.

with APISession("") as session:
    # Starting a connection to MGnify's API just like we did before
     for analysisId in analyses.head(2)['attributes.accession']:
        # Here we are starting a for loop to iterate over values in the 'attributes.accession' column
        # of the analyses dataframe (this column contains analysis accessions). However, we only want 
        # to iterate over the first two accessions which is why we use the .head(2) method to give us 
        # just the first 2 records.
         print(f"Processing: {analysisId}")
        # Print the analysis accession we are currently processing
         for download in session.iterate(
            # Start another for loop and use the .iterate() method to retrieve information about downloads 
            # associated with the current analysis ID. The method call is formatted with the analysisId to 
            # form the specific API endpoint: analyses/{analysisId}/downloads where the value for analysisId
            # will be plugged in for each iteration.
             OTUFiles = f"{analysisId}_MERGED_FASTQ_SSU_OTU.tsv"
            # This line generates the file name of the OTU results file: analysis accession followed by the
            # string "_MERGED_FASTQ_SSU_OTU.tsv"
             if (
                 download.description.label =='OTUs, counts and taxonomic assignments for SSU rRNA'
        == 'TSV'
                # We only want to download the OTU files in this case and not any other result files. 
                # This piece of code is looking for those files by checking the format and file description.
                # Once the correct format and description are found, the next block of code is executed.
                 print(f"Downloading file for {analysisId}")
                 urlretrieve(download.links.self.url, OTUFiles)
                # Fetch the file link to which is stored in download.links.self.url. When saving the file locally,
                # give it the name stored in the variable OTUFiles (f"{analysisId}_MERGED_FASTQ_SSU_OTU.tsv")
                # After downloading the file, the code breaks out of the inner for loop since only one matching 
                # file is needed for each analysis ID. We don't need to check any more records.
         print("Finished for", analysisId)

You can see the downloaded tables in the left-hand panel. Double-click the file names to open and browse them.

Section 2: exploring the MGnify Genomes resource programmatically

In this section of the practical exercise we will look at a few ways how Python can be used to explore the MGnify Genomes resource via the API.

Task 1: Query the genomes database from MGnify API

We will be using the genomes endpoint to work with MGnify Genomes:

endpoint = 'genomes'

Let’s fetch data from this endpoint and see what the first result looks like:

r = requests.get(f"{endpoint}")

Here we are looking at one genome, the metadata for this genome, results summary and links. You can see what types of information we can get for each genome from this API endpoint. Take a moment to look over the output.

Q: can you spot what catalogue this genome is from?
Q: what is the taxon lineage for this genome?

Task 2: Search for a specific taxon

If we are interested in a specific taxon, for example, a specific genus or species, we can find information about it from the API.
Let’s use the genus Prevotella and the species Prevotella melaninogenica as our examples.

The taxon-lineage field contains domain, phylum, class, order, family, genus, species in the following format:
d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__Prevotella melaninogenica

If we wanted to filter the database for the genus Prevotella we could use the full lineage: d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella
or only part of it:
g__Prevotella or Prevotella.

Set the desired filters
genus_filter = 'Prevotella'
species_filter = 'Prevotella melaninogenica'
Query the database with the species filter and save to a Pandas DataFrame
with APISession("") as mgnify:
    search_filter_sp = Modifier(f"taxon_lineage={species_filter}")
    resources_sp = map(lambda r: r.json, mgnify.iterate(endpoint, filter=search_filter_sp))
    resources_df_sp = pd.json_normalize(resources_sp)
# Here we are using the Modifier function from the jsonapi_client module. It allows us 
# to query for specific values in given fields (e.g.: 'taxon-lineage', 'geographic-origin').
# In this case we are adding a filter that would limit results to those that have the species
# name that we have saved in the species_filter variable (Prevotella melaninogenica) in the 
# taxon_lineage field.
# Display the table containing the results of the query
Q: how many times does the species occur in the catalogues? In which biomes?
Query the database with the genus filter and store the results in a Pandas DataFrame.
with APISession("") as mgnify:
    search_filter = Modifier(f"taxon_lineage={genus_filter}")
    resources = map(lambda r: r.json, mgnify.iterate(endpoint, filter=search_filter))
    resources_df = pd.json_normalize(resources)
# Display the table containing the results of the query
Q: how many times does the genus occur in MGnify Genomes?

Let’s get more information on the occurrence of this genus in MGnify Genomes.

Which catalogues does the genus occur in?

We can use the information from the column to generate a pie chart that shows the catalogues where the genus can be found:

resources_df.groupby('').size().plot(kind='pie', autopct='%1.0f%%')
plt.title('Catalogues where the genus Prevotella is found');
What are the most common species from the Prevotella genus across the catalogues?

First, let’s see what the total number of distinct lineages are within the Prevotella genus across the catalogues:


Now let’s find 10 most common species:


The top record occurs in the catalogues 87 times but there is no species name (the name is s__ indicating that it could not be assigned by GTDB-Tk, the tool that we use for taxonomic analysis). This indicates that the species is not yet known in GTDB.

Note: the 87 records represent multiple unknown (novel) species. In other words, across the catalogues we have 87 species representative genomes that belong to the genus Prevotella but we don’t know what species they are.

The next most common species occurs in the catalogues 3 times and there are at least 9 species like that as we can see from the table (there are likely more but we only printed the first 10 lines). Since our catalogues are dereplicated at the species level, each species can only be found in a biome-specific catalogue once. In total we currently have 9 catalogues so we wouldn’t expect the maximum number of occurences for any species to exceed 9.

Which catalogues contain the novel Prevotella species?
# Save the rows containing novel Prevotella species into a new dataframe
novel_prevotella_df = resources_df[resources_df['attributes.taxon-lineage'] == 'd__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella;s__']
# Plot the catalogues
novel_prevotella_df.groupby('').size().plot(kind='pie', autopct='%1.0f%%')
plt.title('Catalogues where the novel Prevotella species are found');
Q: Which catalogue has a higher number of novel Prevotella species?
Q: Which catalogue only contains known Prevotella species?

Hint: compare this pie chart to the previous one - which catalogue disappeared?

Task 3: Create a graph showing most common phyla in MGnify Genomes.

For this exercise we will be working with pre-fetched data for all genomes. This is because fetching it takes a while given the number of genomes but if you were fetching it yourself, the code for that is below (expand by clicking on the three dots but please don’t uncomment and execute the code).

# endpoint = "genomes"
# with APISession("") as mgnify:
#     resources_all = map(lambda r: r.json, mgnify.iterate(endpoint))
# resources_all_df = pd.json_normalize(resources_all)
# resources_all_df.to_parquet('all_genome_resources_Aug2023.parquet')

The table with all genomes has been saved as a Parquet file. This file format allows for efficient data storage and retrieval. The file is located in ../example-data/genomes/ directory and is called all_genome_resources_Aug2023.parquet.

Let’s load it into a dataframe all_genomes_df:

all_genomes_df = pd.read_parquet('../example-data/genomes/all_genome_resources_Aug2023.parquet')
# Let's check what this dataframe looks like
# Using Pandas we could also do some quick analysis, for example, what is the average GC content of the 
# genomes in MGnify? We can use the .describe() method that calculates statistical information.

This is telling us that in total there are 12,530 GC content records (matching the total number of species representative genomes) and the mean GC% is 47.65%.

Now let’s plot the phyla. We need to do some prep work first:

# Split taxon lineage into columns (separate column for each taxonomic level)

# Write each possible taxonomic level into a list called "features"
features = ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']

def get_lineage_column(lineage_str, i):
    # This is a Python function that takes in a full taxonomic lineage, for example,
    # d__Bacteria;p__Bacillota;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__Lactobacillus delbrueckii
    # and returns the value in the requested position. In Python we count from 0 so 
    # if we requested the taxon name in position 1 (i = 1), the function would 
    # return "p__Bacillota"

    lineage_split = lineage_str.split(';')
    return lineage_split[i] 

# Apply the function to create the new columns in the pandas DataFrame by splitting column attributes.taxon-lineage
for i, feature in enumerate(features):
    all_genomes_df[feature] = all_genomes_df['attributes.taxon-lineage'].apply(lambda x: get_lineage_column(x, i))
# Let's see what the new columns look like (scroll to the right)
# Use the value_counts() to get the count of each phylum in the corresponding column.
# Sort values from largest to smallest by using method .sort_values(ascending=False) and keep just the first 20
# records (20 most common phyla). Store them in the phylum_counts series (one-dimensional array).

phylum_counts = all_genomes_df['phylum'].value_counts().sort_values(ascending=False).head(20)

We can add up all counts from phylum_counts to see what proportion of all genomes are from the top 20 phyla:


This tells us that out of a total of 12,530 genomes, 12,215 are in the 20 most abundant phyla.

# Create the bar plot
plt.figure(figsize=(10, 6))  # Optional: Set the figure size

# Optional: Customize the plot
plt.title('Occurrences of each Phylum')

# Show the plot

Task 4: produce a quality control figure similar to Extended Data Fig. 4a of Almeida et al 2020

Use the .describe() method to get statistical information for the completeness and contamination columns in the all_genomes_df dataframe:

all_genomes_df[['attributes.completeness', 'attributes.contamination']].describe()

and make a plot:

fig = plt.figure(figsize=(5, 10), layout="constrained")  # set the size and layout
spec = fig.add_gridspec(1, 1)  # set the grid

ax00 = fig.add_subplot(spec[0, 0])  # create the placeholder for the boxplot

# Create the boxplot using Seaborn's boxplot function. It visualizes the distribution 
# of two columns from the DataFrame 'all_genomes_df': 'attributes.completeness' and 'attributes.contamination'. 
# The boxplot shows the median, quartiles, and any outliers in the data for each column.
sns.boxplot(data=all_genomes_df[['attributes.completeness', 'attributes.contamination']])

plt.ylabel("%")  # set the y-axis labes to show percentages

fig.suptitle('Quality of genomes in MGnify Genomes');

Bonus section: write your own code.

The goal of this bonus section is to give you an opportunity to practice writing code to access the MGnify API yourself. We will provide a template and there is a solution at the bottom if you need help.

Task 1: query the API to find out how many genomes there are in each catalogue.

First you need to find the API endpoint to use. Open this link to look through the list of endpoints. You can click on them to see what information is inside. Once you have picked the endpoint, fill in the blank below:

endpoint = "_________"  # fill in the blank

Write a codeblock below that will start an API session, iterate over the pages using the chosen endpoint and get JSONs for all catalogues. Convert JSONs into a dataframe called resources_df. Then print just the columns that contain catalogue name and the number of species in each (the species counts are saved in a variable called attributes.genome-count and there are several columns that contain catalogue name, you can pick any one.

Hint: the code for quering the API is the same we already used to fetch analyses for a selected study earlier in the practical (but the endpoint is different)

with ______("_____________________") as mgnify:
    resources = __________________
    resources_df = __________________________
resources_df[['_____________', 'attributes.genome-count']]

Click on the three dots below to open the solution

# Solution
endpoint = "genome-catalogues"

with APISession("") as mgnify:
    resources = map(lambda r: r.json, mgnify.iterate(endpoint))
    resources_df = pd.json_normalize(resources)
# print the columns that contain the catalogue name and the number of genomes in each catalogue
resources_df[['', 'attributes.genome-count']]

Task 2: generate a pie chart that would show relative sizes of the catalogues. Add catalogue names as labels on the chart.

# Fill in the blanks in this code to generate the pie chart

# Extract the '' and 'attributes.genome-count' columns from the resources_df dataframe
catalogue_names = _________________________
genome_counts = __________________________

# Create the pie chart
plt.figure(figsize=(8, 8))  # Set the figure size
plt.pie(___________, labels=__________, autopct='%1.1f%%', startangle=140)

# Set the title
plt.title('Relative Sizes of Catalogues')

# Show the pie chart

Click on the three dots below to see the solution.

# Solution
# Extract the '' and 'attributes.genome-count' columns
catalogue_names = resources_df['']
genome_counts = resources_df['attributes.genome-count']

# Create the pie chart
plt.figure(figsize=(8, 8))  # Set the figure size
plt.pie(genome_counts, labels=catalogue_names, autopct='%1.1f%%', startangle=140)

# Set the title
plt.title('Relative Sizes of Catalogues')

# Show the pie chart

Task 3: download a genome from the API

We have downloaded analysis results earlier in this practical (at the end of Section 1). You can download genomes and annotations in a similar way. For this exercise, let’s download all predicted CDS for genome MGYG000321626.

Start by identifying the endpoint you would need for this here.

Hint: review what endpoints looked like when we downloaded analyses - they are not always just one level beyond the base URL.

endpoint = "______________"

Fill in the blanks in the code block below to download the file. To see the file format and description label that we are looking for, use the link to API, navigate to the appropriate endpoint and see what the results look like in the browser.

endpoint = "___________________" 

with APISession("__________________") as session:    
    for download in session.iterate(endpoint):
        # Go over the different files to download until we find the one with CDS results
        cds_file = "MGYG000321626.faa"
        if (
            download.description.label =='_____________'
   == '_____________'
                print(f"Downloading the file")
print("Finished downloading", cds_file)

Click on the three dots below to see the solution.

endpoint = "genomes/MGYG000321626/downloads"  # we are only interested in a specific genome

with APISession("") as session:    
    for download in session.iterate(endpoint):
        # Go over the different files to download until we find the one with CDS results
        cds_file = "MGYG000321626.faa"
        if (
            download.description.label =='Predicted CDS (aa)'
   == 'FASTA'
                print(f"Downloading the file")
                urlretrieve(download.links.self.url, cds_file)
print("Finished downloading", cds_file)