Search for Samples or Studies

Authors

Affiliations

Sandy Rogers

MGnify team at EMBL-EBI

Ben Allen

Newcastle University

This is a static preview

You can run and edit these examples interactively on Galaxy

Search for MGnify Studies or Samples, using MGnifyR

The MGnify API returns data and relationships as JSON. MGnifyR is a package to help you read MGnify data into your R analyses.

This example shows you how to perform a search of MGnify Studies or Samples

You can find all of the other “API endpoints” using the Browsable API interface in your web browser. This interface also lets you inspect the kinds of Filters that can be created for each list.

This is an interactive code notebook (a Jupyter Notebook). To run this code, click into each cell and press the ▶ button in the top toolbar, or press shift+enter.

library(IRdisplay)
display_markdown(file = '../_resources/mgnifyr_help.md')

Help with MGnifyR

MGnifyR is an R package that provides a convenient way for R users to access data from the MGnify API.

Detailed help for each function is available in R using the standard ?function_name command (i.e. typing ?mgnify_query will bring up built-in help for the mgnify_query command).

A vignette is available containing a reasonably verbose overview of the main functionality. This can be read either within R with the vignette("MGnifyR") command, or in the development repository

MGnifyR Command cheat sheet

The following list of key functions should give a starting point for finding relevent documentation.

mgnify_client() : Create the client object required for all other functions.
mgnify_query() : Search the whole MGnify database.
mgnify_analyses_from_xxx() : Convert xxx accessions to analyses accessions. xxx is either samples or studies.
mgnify_get_analyses_metadata() : Retrieve all study, sample and analysis metadata for given analyses.
mgnify_get_analyses_phyloseq() : Convert abundance, taxonomic, and sample metadata into a single phyloseq object.
mgnify_get_analyses_results() : Get functional annotation results for a set of analyses.
mgnify_download() : Download raw results files from MGnify.
mgnify_retrieve_json() : Low level API access helper function.

Load packages:

library(dplyr)
library(vegan)
library(ggplot2)
library(phyloseq)
library(MGnifyR)

mg <- mgnify_client(usecache = T, cache_dir = '/tmp/mgnify_cache')


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: permute

Loading required package: lattice

This is vegan 2.6-4

Example: Find Polar Samples
Example: Find Wastewater Samples
More Sample filters
More Study filters
Example: Filtering Samples both API-side and client-side

Documentation for `mgnify_query`

?mgnify_query

Example: find Polar samples

In these examples we set maxhits=1 to retrieve only the first page of results. You can change the limit or set it to -1 to retrieve all samples matching the query.

samps_np <- mgnify_query(mg, "samples", latitude_gte=88, maxhits=1)
samps_sp <- mgnify_query(mg, "samples", latitude_lte=-88, maxhits=1)
samps_polar <- bind_rows(samps_np, samps_sp)

head(samps_polar)

A data.frame: 6 × 54
	latitude	biosample	longitude	accession	collection-date	sample-desc	environment-biome	environment-feature	environment-material	sample-name	⋯	size fraction upper threshold	temperature	salinity	target gene	host-tax-id	altitude	host common name	host taxid	host scientific name	species
	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
SRS2738766	89.9903	SAMN08138098	89.2525	SRS2738766	2015-09-07	Keywords: GSC:MIxS MIMARKS:5.0	marine biome	sea ice floe	sea ice	GEOTRACES_2015_09_07_Station33_1.2_A_KD017	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS2738765	89.9903	SAMN08138099	89.2525	SRS2738765	2015-09-07	Keywords: GSC:MIxS MIMARKS:5.0	marine biome	sea ice floe	sea ice	GEOTRACES_2015_09_07_Station33_1_A_KD019	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS2738693	88.4072	SAMN08138091	-176.7614	SRS2738693	2015-09-04	Keywords: GSC:MIxS MIMARKS:5.0	marine biome	sea ice floe	sea ice	GEOTRACES_2015_09_04_Station31_0.2_A_KD009	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS2738695	88.4072	SAMN08138089	-176.7614	SRS2738695	2015-09-04	Keywords: GSC:MIxS MIMARKS:5.0	marine biome	sea ice floe	sea ice	GEOTRACES_2015_09_04_Station31_0.4_A_KD007	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS2738694	88.4072	SAMN08138088	-176.7614	SRS2738694	2015-09-04	Keywords: GSC:MIxS MIMARKS:5.0	marine biome	sea ice floe	sea ice	GEOTRACES_2015_09_04_Station31_0.8_A_KD006	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS2738692	88.4072	SAMN08138090	-176.7614	SRS2738692	2015-09-04	Keywords: GSC:MIxS MIMARKS:5.0	marine biome	sea ice floe	sea ice	GEOTRACES_2015_09_04_Station31_1_A_KD008	⋯	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Example: find Wastewater studies

studies_ww <- mgnify_query(mg, "studies", biome_name="wastewater", maxhits=1)

head(studies_ww)

A data.frame: 6 × 12
	samples-count	accession	bioproject	is-private	last-update	secondary-accession	centre-name	study-abstract	study-name	data-origination	acc_type	type
	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
MGYS00006570	85	MGYS00006570	PRJEB71375	FALSE	2024-08-15T16:20:26	ERP156179	EMG	The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJNA230567, and was assembled with metaspades v3.15.3. This project includes samples from the following biomes: root:Engineered:Wastewater.	EMG produced TPA metagenomics assembly of PRJNA230567 data set (Systems Biology of Lipid Accumulating Organisms).	SUBMITTED	studies	studies
MGYS00006558	89	MGYS00006558	PRJNA230567	FALSE	2023-12-19T12:35:08	SRP033648	Luxembourg Centre for Systems Biomedicine	Characterization of microbial communities at the genomic, transcriptomic, proteomic and metabolomic levels, with a special interest on lipid accumulating bacterial populations, which are naturally enriched in biological wastewater treatment systems and may be harnessed for the conversion of mixed lipid substrates (wastewater) into biodiesel. The project aims to elucidate the genetic blueprints and the functional relevance of specific populations within the community. It focuses on within-population genetic and functional heterogeneity, trying to understand how fine-scale variations contribute to differing lipid accumulating phenotypes. Insights from this project will contribute to the understanding the functioning of microbial ecosystems, and improve optimization and modeling strategies for current and future biological wastewater treatment processes. This BioProject contains datasets derived from the same biological wastewater treatment plant. The date includes metagenomes, metatranscriptomes and organisms isolated in pure cultures.	Systems Biology of Lipid Accumulating Organisms	HARVESTED	studies	studies
MGYS00005985	1	MGYS00005985	PRJEB45225	FALSE	2022-03-11T21:49:39	ERP129301	EMG	The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJNA593593, and was assembled with metaSPAdes v3.15.2. This project includes samples from the following biomes: root:Engineered:Wastewater.	EMG produced TPA metagenomics assembly of PRJNA593593 data set (Sewage microbial communities from Oakland, California, United States - Biofuel Metagenome 10).	SUBMITTED	studies	studies
MGYS00005997	1	MGYS00005997	PRJEB45727	FALSE	2022-03-11T21:12:02	ERP129875	EMG	The Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set PRJNA593594, and was assembled with metaSPAdes v3.15.2. This project includes samples from the following biomes: root:Engineered:Wastewater.	EMG produced TPA metagenomics assembly of PRJNA593594 data set (Sewage microbial communities from Oakland, California, United States - Biofuel Metagenome 11).	SUBMITTED	studies	studies
MGYS00005986	1	MGYS00005986	PRJNA593593	FALSE	2022-02-28T14:04:08	SRP270050	DOE Joint Genome Institute	Sewage-derived enrichment culture (anaerobic medium, 0.1% glucose), planktonic phase	Sewage microbial communities from Oakland, California, United States - Biofuel Metagenome 10	HARVESTED	studies	studies
MGYS00002316	1	MGYS00002316	PRJEB24109	FALSE	2022-02-03T15:58:54	ERP105914	EMBL-EBI	The activated sludge metagenome Third Party Annotation (TPA) assembly was derived from the primary whole genome shotgun (WGS) data set: PRJNA340752. This project includes samples from the following biomes: Engineered, Wastewater, Activated Sludge.	EMG produced TPA metagenomics assembly of the Active sludge microbial communities of municipal wastewater-treating anaerobic digesters from China - AD_SCU002_MetaG metagenome (activated sludge metagenome) data set.	SUBMITTED	studies	studies

More filters to try:

Samples by location

more_northerly_than <- mgnify_query(mg, "samples", latitude_gte=88, maxhits=1)

more_southerly_than <- mgnify_query(mg, "samples", latitude_lte=-88, maxhits=1)

more_easterly_than <- mgnify_query(mg, "samples", longitude_gte=170, maxhits=1)

more_westerly_than <- mgnify_query(mg, "samples", longitude_lte=170, maxhits=1)

at_location <- mgnify_query(mg, "samples", geo_loc_name="usa", maxhits=1)

Samples by biome

biome_within_wastewater <- mgnify_query(mg, "samples", biome_name="wastewater", maxhits=1)

Samples by metadata

There are a large number of metadata key:value pairs, because these are author-submitted, along with the samples, to the ENA archive.

If you know how to specify the metadata key:value query for the samples you’re interested in, you can use this form to find matching Samples:

from_ex_smokers <- mgnify_query(mg, "samples", metadata_key="smoker", metadata_value="ex-smoker", maxhits=-1)

To find metadata_keys and values, it is best to browse the interactive API Browser, and use the Filters button to construct queries interactively at first.

Studies by centre name

from_smithsonian <- mgnify_query(mg, "studies", centre_name="Smithsonian", maxhits=-1)

To find metadata_keys and values, it is best to browse the interactive API Browser, and use the Filters button to construct queries interactively at first.

Example: adding additional filters to the data frame

First, fetch some samples from the Lentic biome. We can specify the entire Biome lineage, too.

lentic_samples <- mgnify_query(mg, "samples", biome_name="root:Environmental:Aquatic:Lentic", usecache=T)

Now, also filter by depth within the returned results, using normal R syntax.

depth_numeric = as.numeric(lentic_samples$depth)  # We must convert data from MGnifyR (always strings) to numerical format.
depth_numeric[is.na(depth_numeric)] = 0.0  # If depth data is missing, assume it is surface-level.
lentic_subset = lentic_samples[depth_numeric >=25 & depth_numeric <=50,]  # Filter to samples collected between 25m and 50m down.
lentic_subset

A data.frame: 16 × 37
	latitude	biosample	longitude	accession	collection-date	sample-desc	sample-name	sample-alias	last-update	geographic location (longitude)	⋯	instrument model	last update date	investigation type	project name	geographic location (depth)	geographic location (altitude)	environmental package	sequencing method	NCBI sample classification	ENA checklist
	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
SRS992699	17.39	SAMN03860260	40.54	SRS992699	2011-10-15	12	sample03	sample03	2020-05-18T00:52:00	40.54	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992702	20.31	SAMN03860274	38.46	SRS992702	2011-10-15	91	sample17	sample17	2020-05-18T00:51:47	38.46	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992693	17.39	SAMN03860259	40.54	SRS992693	2011-10-15	12	sample02	sample02	2020-05-18T00:50:43	40.54	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992705	23.36	SAMN03860286	37.3	SRS992705	2011-10-15	149	sample29	sample29	2020-05-18T00:46:05	37.3	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992692	18.34	SAMN03860268	40.44	SRS992692	2011-10-15	34	sample11	sample11	2020-05-18T00:45:26	40.44	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992710	22.2	SAMN03860281	37.55	SRS992710	2011-10-15	108	sample24	sample24	2020-05-18T00:35:28	37.55	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992714	25.46	SAMN03860292	36.6	SRS992714	2011-10-15	169	sample35	sample35	2020-05-18T00:35:15	36.6	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992704	23.36	SAMN03860287	37.3	SRS992704	2011-10-15	149	sample30	sample30	2020-05-18T00:27:10	37.3	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992713	25.46	SAMN03860293	36.6	SRS992713	2011-10-15	169	sample36	sample36	2020-05-18T00:13:07	36.6	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992696	22.2	SAMN03860280	37.55	SRS992696	2011-10-15	108	sample23	sample23	2020-05-18T00:06:42	37.55	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992729	17.59	SAMN03860263	39.47	SRS992729	2011-10-15	22	sample06	sample06	2020-05-17T10:16:43	39.47	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992701	20.31	SAMN03860275	38.46	SRS992701	2011-10-15	91	sample18	sample18	2020-05-17T10:11:28	38.46	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992691	18.34	SAMN03860269	40.44	SRS992691	2011-10-15	34	sample12	sample12	2020-05-17T10:01:44	40.44	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992721	17.59	SAMN03860262	39.47	SRS992721	2011-10-15	22	sample05	sample05	2020-05-17T10:01:21	39.47	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992720	27.53	SAMN03860299	34.3	SRS992720	2011-10-15	192	sample42	sample42	2020-05-17T10:00:58	34.3	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA
SRS992719	27.53	SAMN03860298	34.3	SRS992719	2011-10-15	192	sample41	sample41	2020-05-17T10:00:45	34.3	⋯	Illumina HiSeq 2000	NA	NA	NA	NA	NA	NA	NA	NA	NA

Search for MGnify Studies or Samples, using MGnifyR

Help with MGnifyR

MGnifyR Command cheat sheet

Contents

Documentation for mgnify_query

Example: find Polar samples

Example: find Wastewater studies

More filters to try:

Samples by location

Samples by biome

Samples by metadata

Studies by centre name

Example: adding additional filters to the data frame

Documentation for `mgnify_query`