Search for Samples or Studies

R
Authors
Affiliations

Sandy Rogers

MGnify team at EMBL-EBI

Ben Allen

Newcastle University

This is a static preview

You can run and edit these examples interactively on Galaxy

Search for MGnify Studies or Samples, using MGnifyR

The MGnify API returns data and relationships as JSON. MGnifyR is a package to help you read MGnify data into your R analyses.

This example shows you how to perform a search of MGnify Studies or Samples

You can find all of the other “API endpoints” using the Browsable API interface in your web browser. This interface also lets you inspect the kinds of Filters that can be created for each list.

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter.


library(IRdisplay)
display_markdown(file = '../_resources/mgnifyr_help.md')

Help with MGnifyR

MGnifyR is an R package that provides a convenient way for R users to access data from the MGnify API.

Detailed help for each function is available in R using the standard ?function_name command (i.e. typing ?mgnify_query will bring up built-in help for the mgnify_query command).

A vignette is available containing a reasonably verbose overview of the main functionality. This can be read either within R with the vignette("MGnifyR") command, or in the development repository

MGnifyR Command cheat sheet

The following list of key functions should give a starting point for finding relevent documentation.

  • mgnify_client() : Create the client object required for all other functions.
  • mgnify_query() : Search the whole MGnify database.
  • mgnify_analyses_from_xxx() : Convert xxx accessions to analyses accessions. xxx is either samples or studies.
  • mgnify_get_analyses_metadata() : Retrieve all study, sample and analysis metadata for given analyses.
  • mgnify_get_analyses_phyloseq() : Convert abundance, taxonomic, and sample metadata into a single phyloseq object.
  • mgnify_get_analyses_results() : Get functional annotation results for a set of analyses.
  • mgnify_download() : Download raw results files from MGnify.
  • mgnify_retrieve_json() : Low level API access helper function.

Load packages:

library(vegan)
library(ggplot2)
library(phyloseq)
library(MGnifyR)

mg <- mgnify_client(usecache = T, cache_dir = '/tmp/mgnify_cache')
Loading required package: permute

Loading required package: lattice

This is vegan 2.6-4

Contents

Documentation for mgnify_query

?mgnify_query

Example: find Polar samples

In these examples we set maxhits=1 to retrieve only the first page of results. You can change the limit or set it to -1 to retrieve all samples matching the query.

samps_np <- mgnify_query(mg, "samples", latitude_gte=88, maxhits=1)
samps_sp <- mgnify_query(mg, "samples", latitude_lte=-88, maxhits=1)
samps_polar <- rbind(samps_np, samps_sp)
head(samps_polar)
A data.frame: 6 × 48
longitude biosample latitude accession analysis-completed collection-date geo-loc-name sample-desc environment-biome environment-feature depth acc_type biome type temperature environmental package salinity investigation type elevation target gene
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
ERS1568988 -89.2525 SAMEA98155168 89.9903 ERS1568988 2017-03-13 2015-09-07 Arctic Ocean Arctic seawater metagenome, 5 m marine biome marine water body 5.0 samples root:Environmental:Aquatic:Marine:Oceanic samples NA NA NA NA NA NA
ERS1972379 -176.7614 SAMEA104347401 88.4072 ERS1972379 2017-11-20 NA NA Station 31 sea ice, depth 0.1m marine biome marine water body 0.1 samples root:Environmental:Aquatic:Marine:Oceanic samples -999.0 water 999.0 NA NA NA
ERS1972376 -176.7614 SAMEA104347398 88.4072 ERS1972376 2017-11-20 NA NA Station 31 sea ice, depth 0.5m marine biome marine water body 0.5 samples root:Environmental:Aquatic:Marine:Oceanic samples -999.0 water 999.0 NA NA NA
ERS1568987 -89.2525 SAMEA98154418 89.9903 ERS1568987 2017-03-13 2015-09-07 Arctic Ocean Arctic seawater metagenome, 1.5 m marine biome marine water body 1.5 samples root:Environmental:Aquatic:Marine:Oceanic samples NA NA NA NA NA NA
ERS1972377 -176.7614 SAMEA104347399 88.4072 ERS1972377 2017-11-20 NA NA Station 31 sea ice, depth 0.5m marine biome marine water body 0.5 samples root:Environmental:Aquatic:Marine:Oceanic samples -999.0 water 999.0 NA NA NA
ERS1972391 -89.2525 SAMEA104347413 89.9903 ERS1972391 2017-11-20 NA NA Station 33 sea ice, depth 0.1m marine biome marine water body 0.1 samples root:Environmental:Aquatic:Marine:Oceanic samples -999.0 water 999.0 NA NA NA

Example: find Wastewater studies

studies_ww <- mgnify_query(mg, "studies", biome_name="wastewater", maxhits=1)
head(studies_ww)
A data.frame: 6 × 12
bioproject samples-count accession is-private secondary-accession centre-name study-abstract study-name data-origination last-update acc_type type
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
MGYS00005985 PRJEB45225 1 MGYS00005985 FALSE ERP129301 EMG The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJNA593593, and was assembled with metaSPAdes v3.15.2. This project includes samples from the following biomes: root:Engineered:Wastewater. EMG produced TPA metagenomics assembly of PRJNA593593 data set (Sewage microbial communities from Oakland, California, United States - Biofuel Metagenome 10). SUBMITTED 2022-03-11T21:49:39 studies studies
MGYS00005997 PRJEB45727 1 MGYS00005997 FALSE ERP129875 EMG The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJNA593594, and was assembled with metaSPAdes v3.15.2. This project includes samples from the following biomes: root:Engineered:Wastewater. EMG produced TPA metagenomics assembly of PRJNA593594 data set (Sewage microbial communities from Oakland, California, United States - Biofuel Metagenome 11). SUBMITTED 2022-03-11T21:12:02 studies studies
MGYS00005986 PRJNA593593 1 MGYS00005986 FALSE SRP270050 DOE Joint Genome Institute Sewage-derived enrichment culture (anaerobic medium, 0.1% glucose), planktonic phase Sewage microbial communities from Oakland, California, United States - Biofuel Metagenome 10 HARVESTED 2022-02-28T14:04:08 studies studies
MGYS00002316 PRJEB24109 1 MGYS00002316 FALSE ERP105914 EMBL-EBI The activated sludge metagenome Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set: PRJNA340752. This project includes samples from the following biomes: Engineered, Wastewater, Activated Sludge. EMG produced TPA metagenomics assembly of the Active sludge microbial communities of municipal wastewater-treating anaerobic digesters from China - AD_SCU002_MetaG metagenome (activated sludge metagenome) data set. SUBMITTED 2022-02-03T15:58:54 studies studies
MGYS00005846 PRJEB47494 110 MGYS00005846 FALSE ERP131768 EMG The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJEB27054, and was assembled with metaSPAdes v3.12.0. This project includes samples from the following biomes: root:Engineered:Wastewater:Water and sludge. EMG produced TPA metagenomics assembly of PRJEB27054 data set (Global surveillance of antimicrobial resistance). SUBMITTED 2021-11-18T06:32:39 studies studies
MGYS00005847 PRJEB27054 109 MGYS00005847 FALSE ERP109094 DTU-GE Antimicrobial resistance (AMR) is one of the most serious global public health threats, however, obtaining representative data on AMR for healthy human populations is difficult. We characterized the bacterial resistome from untreated sewage from 79 sites in 60 countries. We found systematic differences in abundance and diversity of AMR genes between Europe/North-America/Oceania and Africa/Asia/South-America. Antimicrobial use data only explained a minor part of the AMR variation and no evidence for cross-selection between antimicrobial classes nor effect of travel by flight between sites were found. However, AMR abundance was strongly correlated with socio-economic, health and environmental factors, which we used to predict AMR abundances in all countries in the world. Our findings suggest that the global AMR gene diversity and abundance varies by region and are caused by national circumstances. Improving sanitation and health could potentially limit the global burden of AMR. We propose to use sewage for an ethically acceptable and economically feasible continuous global surveillance and prediction of AMR. Global surveillance of antimicrobial resistance SUBMITTED 2021-11-17T00:54:21 studies studies

More filters to try:

Samples by location

more_northerly_than <- mgnify_query(mg, "samples", latitude_gte=88, maxhits=1)

more_southerly_than <- mgnify_query(mg, "samples", latitude_lte=-88, maxhits=1)

more_easterly_than <- mgnify_query(mg, "samples", longitude_gte=170, maxhits=1)

more_westerly_than <- mgnify_query(mg, "samples", longitude_lte=170, maxhits=1)

at_location <- mgnify_query(mg, "samples", geo_loc_name="usa", maxhits=1)

Samples by biome

biome_within_wastewater <- mgnify_query(mg, "samples", biome_name="wastewater", maxhits=1)

Samples by metadata

There are a large number of metadata key:value pairs, because these are author-submitted, along with the samples, to the ENA archive.

If you know how to specify the metadata key:value query for the samples you’re interested in, you can use this form to find matching Samples:

from_ex_smokers <- mgnify_query(mg, "samples", metadata_key="smoker", metadata_value="ex-smoker", maxhits=-1)

To find metadata_keys and values, it is best to browse the interactive API Browser, and use the Filters button to construct queries interactively at first.

Studies by centre name

from_smithsonian <- mgnify_query(mg, "studies", centre_name="Smithsonian", maxhits=-1)

To find metadata_keys and values, it is best to browse the interactive API Browser, and use the Filters button to construct queries interactively at first.


Example: adding additional filters to the data frame

First, fetch some samples from the Lentic biome. We can specify the entire Biome lineage, too.

lentic_samples <- mgnify_query(mg, "samples", biome_name="root:Environmental:Aquatic:Lentic", usecache=T)

Now, also filter by depth within the returned results, using normal R syntax.

depth_numeric = as.numeric(lentic_samples$depth)  # We must convert data from MGnifyR (always strings) to numerical format.
depth_numeric[is.na(depth_numeric)] = 0.0  # If depth data is missing, assume it is surface-level.
lentic_subset = lentic_samples[depth_numeric >=25 & depth_numeric <=50,]  # Filter to samples collected between 25m and 50m down.
lentic_subset
A data.frame: 16 × 37
longitude latitude biosample accession collection-date sample-desc sample-name sample-alias last-update geographic location (latitude) instrument model last update date investigation type project name geographic location (depth) geographic location (altitude) environmental package sequencing method NCBI sample classification ENA checklist
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
SRS992699 40.54 17.39 SAMN03860260 SRS992699 2011-10-15 12 sample03 sample03 2020-05-18T00:52:00 17.39 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992702 38.46 20.31 SAMN03860274 SRS992702 2011-10-15 91 sample17 sample17 2020-05-18T00:51:47 20.31 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992693 40.54 17.39 SAMN03860259 SRS992693 2011-10-15 12 sample02 sample02 2020-05-18T00:50:43 17.39 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992705 37.3 23.36 SAMN03860286 SRS992705 2011-10-15 149 sample29 sample29 2020-05-18T00:46:05 23.36 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992692 40.44 18.34 SAMN03860268 SRS992692 2011-10-15 34 sample11 sample11 2020-05-18T00:45:26 18.34 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992710 37.55 22.2 SAMN03860281 SRS992710 2011-10-15 108 sample24 sample24 2020-05-18T00:35:28 22.2 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992714 36.6 25.46 SAMN03860292 SRS992714 2011-10-15 169 sample35 sample35 2020-05-18T00:35:15 25.46 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992704 37.3 23.36 SAMN03860287 SRS992704 2011-10-15 149 sample30 sample30 2020-05-18T00:27:10 23.36 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992713 36.6 25.46 SAMN03860293 SRS992713 2011-10-15 169 sample36 sample36 2020-05-18T00:13:07 25.46 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992696 37.55 22.2 SAMN03860280 SRS992696 2011-10-15 108 sample23 sample23 2020-05-18T00:06:42 22.2 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992729 39.47 17.59 SAMN03860263 SRS992729 2011-10-15 22 sample06 sample06 2020-05-17T10:16:43 17.59 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992701 38.46 20.31 SAMN03860275 SRS992701 2011-10-15 91 sample18 sample18 2020-05-17T10:11:28 20.31 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992691 40.44 18.34 SAMN03860269 SRS992691 2011-10-15 34 sample12 sample12 2020-05-17T10:01:44 18.34 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992721 39.47 17.59 SAMN03860262 SRS992721 2011-10-15 22 sample05 sample05 2020-05-17T10:01:21 17.59 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992720 34.3 27.53 SAMN03860299 SRS992720 2011-10-15 192 sample42 sample42 2020-05-17T10:00:58 27.53 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA
SRS992719 34.3 27.53 SAMN03860298 SRS992719 2011-10-15 192 sample41 sample41 2020-05-17T10:00:45 27.53 Illumina HiSeq 2000 NA NA NA NA NA NA NA NA NA