Interactive map for AtlantECO project


Kate S [Ekaterina Sakharova] (MGnify team)

This is a static preview

You can run and edit these examples interactively on Galaxy

Mapping samples from the AtlantECO Super Study

… using the MGnify API and an interactive map widget

The MGnify API returns JSON data. The jsonapi_client package can help you load this data into Python, e.g. into a Pandas dataframe.

This example shows you how to load a MGnify Super Study’s data from the MGnify API and display it on an interactive world map

You can find all of the other “API endpoints” using the Browsable API interface in your web browser. The URL you see in the browsable API is exactly the same as the one you can use in this code.

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter.

Fetch all AtlantECO studies

A Super Study is a collection of MGnify Studies originating from a major project. AtlantECO is one such project, aiming to develop and apply a novel, unifying framework that provides knowledge-based resources for a better understanding and management of the Atlantic Ocean and its ecosystem services.

Fetch the Super Study’s Studies from the MGnify API, into a Pandas dataframe:

import pandas as pd
from jsonapi_client import Session, Modifier

atlanteco_endpoint = 'super-studies/atlanteco/flagship-studies'
with Session("") as mgnify:
    studies = map(lambda r: r.json, mgnify.iterate(atlanteco_endpoint))
    studies = pd.json_normalize(studies)
type id attributes.samples-count attributes.accession attributes.bioproject attributes.secondary-accession attributes.centre-name attributes.last-update
0 studies MGYS00006075 159 MGYS00006075 PRJEB50073 False ERP134625 EMG The Third Party Annotation (TPA) assembly was ... EMG produced TPA metagenomics assembly of PRJE... SUBMITTED 2023-09-27T18:43:33 [{'id': 'root:Environmental:Aquatic:Marine', '...
1 studies MGYS00005780 480 MGYS00005780 PRJEB45427 False ERP129541 EMG The Third Party Annotation (TPA) assembly was ... EMG produced TPA metagenomics assembly of PRJN... SUBMITTED 2023-09-27T15:08:51 [{'id': 'root:Environmental:Aquatic:Marine', '...
2 studies MGYS00006074 75 MGYS00006074 PRJEB51168 False ERP135767 EMG The Third Party Annotation (TPA) assembly was ... EMG produced TPA metagenomics assembly of PRJE... SUBMITTED 2022-10-21T19:56:09 [{'id': 'root:Environmental:Aquatic:Marine', '...
3 studies MGYS00006072 130 MGYS00006072 PRJEB54918 False ERP139784 EMG The Third Party Annotation (TPA) assembly was ... EMG produced TPA metagenomics assembly of PRJN... SUBMITTED 2022-10-13T14:03:52 [{'id': 'root:Environmental:Aquatic:Marine:Oce...
4 studies MGYS00006034 971 MGYS00006034 PRJEB50181 False ERP134737 EMG The Third Party Annotation (TPA) assembly was ... EMG produced TPA metagenomics assembly of PRJN... SUBMITTED 2022-10-10T12:22:12 [{'id': 'root:Environmental:Aquatic:Marine', '...

Show the studies’ samples on a map

We can fetch the Samples for each Study, and concatenate them all into one Dataframe. Each sample has geolocation data in its attributes - this is what we need to build a map.

It takes time to fetch data for all samples, so let’s show samples from the first 6 studies only.

studies_samples = []

with Session("") as mgnify:
    for idx, study in studies[:6].iterrows():
        print(f"fetching {} samples")
        samples = map(lambda r: r.json, mgnify.iterate(f'studies/{}/samples?page_size=1000'))
        samples = pd.json_normalize(samples)
        samples = pd.DataFrame(data={
            'accession': samples['id'],
            'sample_id': samples['id'],
            'lon': samples['attributes.longitude'],
            'lat': samples['attributes.latitude'],
            'color': "#FF0000",
        samples.set_index('accession', inplace=True)
studies_samples = pd.concat(studies_samples)
fetching MGYS00006075 samples
fetching MGYS00005780 samples
fetching MGYS00006074 samples
fetching MGYS00006072 samples
fetching MGYS00006034 samples
fetching MGYS00006061 samples
print(f"fetched {len(studies_samples)} samples")

fetched 1818 samples
sample_id study lon lat color
ERS491130 ERS491130 MGYS00006075 -53.0063 -64.3595 #FF0000
ERS491201 ERS491201 MGYS00006075 -56.0916 -63.8768 #FF0000
ERS493286 ERS493286 MGYS00006075 -158.0067 22.7546 #FF0000
ERS492680 ERS492680 MGYS00006075 -139.2393 -8.9729 #FF0000
ERS488769 ERS488769 MGYS00006075 63.5851 20.8457 #FF0000
import leafmap
m = leafmap.Map(center=(0, 0), zoom=2)
    popup=["study", "sample_id"], 

Check GO term presence

Let’s check whether a specific identifier is present in each sample. This example is written for GO-term ‘GO:0015878’, but other identifier types are available on the MGnify API.

We will work with MGnify analyses (MGYAs) corresponding to chosen samples. We filter analyses by - pipeline version: 5.0 - experiment type: assembly

This example shows how to process just the first 10 samples (again, because the full dataset takes a while to fetch). Firstly, get analyses for each sample.

analyses = []
with Session("") as mgnify:
    for idx, sample in studies_samples[:10].iterrows():
        print(f"processing {sample.sample_id}")
        filtering = Modifier(f"pipeline_version=5.0&sample_accession={sample.sample_id}&experiment_type=assembly")
        analysis = map(lambda r: r.json, mgnify.iterate('analyses', filter=filtering))
        analysis = pd.json_normalize(analysis)
analyses = pd.concat(analyses)
processing ERS491130
processing ERS491201
processing ERS493286
processing ERS492680
processing ERS488769
processing ERS489043
processing ERS493213
processing ERS1092158
processing ERS494696
processing ERS489585
type id attributes.pipeline-version attributes.experiment-type attributes.analysis-summary attributes.analysis-status attributes.accession attributes.complete-time attributes.instrument-platform attributes.instrument-model
0 analysis-jobs MGYA00616211 5.0 assembly [{'key': 'Submitted nucleotide sequences', 'va... completed MGYA00616211 False 2022-11-08T09:14:44 ILLUMINA Illumina HiSeq 2000 ERZ4843237 assemblies MGYS00006075 studies ERS491130 samples
1 analysis-jobs MGYA00687433 5.0 assembly [{'key': 'Submitted nucleotide sequences', 'va... completed MGYA00687433 False 2023-09-27T17:22:17 ILLUMINA Illumina HiSeq 2000 ERZ4842514 assemblies MGYS00006075 studies ERS491130 samples
2 analysis-jobs MGYA00687456 5.0 assembly [{'key': 'Submitted nucleotide sequences', 'va... completed MGYA00687456 False 2023-09-27T18:43:34 ILLUMINA Illumina HiSeq 2000 ERZ4842561 assemblies MGYS00006075 studies ERS491130 samples
0 analysis-jobs MGYA00615742 5.0 assembly [{'key': 'Submitted nucleotide sequences', 'va... completed MGYA00615742 False 2022-10-25T20:03:49 ILLUMINA Illumina HiSeq 2000 ERZ4842505 assemblies MGYS00006075 studies ERS491201 samples
1 analysis-jobs MGYA00616249 5.0 assembly [{'key': 'Submitted nucleotide sequences', 'va... completed MGYA00616249 False 2022-11-08T11:45:45 ILLUMINA Illumina HiSeq 2000 ERZ4842566 assemblies MGYS00006075 studies ERS491201 samples

Next, check each analysis for GO term presence/absence. We add a column to the dataframe with a colour: blue if GO term was found and red if not.

identifier = "go-terms"
go_term = 'GO:0015878'
go_data = []
with Session("") as mgnify:
    for idx, mgya in analyses.iterrows():
        print(f"processing {}")
        analysis_identifier = map(lambda r: r.json, mgnify.iterate(f'analyses/{}/{identifier}'))
        analysis_identifier = pd.json_normalize(analysis_identifier)
        go_data.append("#0000FF" if go_term in list( else "#FF0000")
analyses.insert(2, identifier, go_data, True)
processing MGYA00616211
processing MGYA00687433
processing MGYA00687456
processing MGYA00615742
processing MGYA00616249
processing MGYA00687455
processing MGYA00615715
processing MGYA00616170
processing MGYA00616183
processing MGYA00616202
processing MGYA00616248
processing MGYA00616316
processing MGYA00687454
processing MGYA00590513
processing MGYA00590515
processing MGYA00590549
processing MGYA00593137
processing MGYA00615713
processing MGYA00687453
processing MGYA00590518
processing MGYA00590565
processing MGYA00687448
processing MGYA00687452
processing MGYA00590511
processing MGYA00593122
processing MGYA00649140
processing MGYA00687451
processing MGYA00615708
processing MGYA00615769
processing MGYA00616176
processing MGYA00616201
processing MGYA00616224
processing MGYA00687442
processing MGYA00687449
processing MGYA00649209
processing MGYA00687447
processing MGYA00615757
processing MGYA00615763
processing MGYA00615801
processing MGYA00615813
processing MGYA00615839
processing MGYA00616214
processing MGYA00616221
processing MGYA00687421
processing MGYA00687446
processing MGYA00593111
processing MGYA00593121
processing MGYA00616324
processing MGYA00687445

Join the analyses and sample tables to have geolocation data and identifier presence data together.

We’ll create a new sub-DataFrame with a subset of the fields and add them to the map.

df = analyses.join(studies_samples.set_index('sample_id'), on='')
df2 = df[[identifier, 'lon', 'lat', 'study', 'attributes.accession', '', '', '']].copy()
df2 = df2.set_index("study")
df2 = df2.rename(columns={"attributes.accession": "analysis_ID", 
                          '': "study_ID",
                          '': "sample_ID", 
                          '': "assembly_ID"
m = leafmap.Map(center=(0, 0), zoom=2)
                     popup=["study_ID", "sample_ID", "assembly_ID", "analysis_ID"],
                    color_column=identifier, add_legend=False)