Remote querying of MGnify Proteins using Parquet and DuckDB

Description of the MGnify Proteins and related services

Author

Affiliation

Published

July 13, 2026

The shift to Parquet

As of the 2026_07 release, we have started shifting towards the use of files in the Apache Parquet format for the distribution of MGnify Proteins sequence data and metadata. There are numerous advantages to this shift:

Columnar data access: As Parquet is a columnar data format, certain query structures that are commonly used to search MGnify Proteins metadata become significantly faster, as only columns of interest need to be searched. Columnar data similarly allows for partial reading of remote files.
Native schemas: Parquet stores metadata about each column within a file - including data types - much like a relational database. These schemas allows for a more explicit data structure, more robust data validation, and more efficient queries e.g. querying based on integer identifiers in a Parquet file vs flat-files without a schema like .tsv, the possibility of applying Bloom filters, etc.
Native compression: Parquet is a naturally compressed file format, with data types allowing for more efficient storing of this data.
Partial reading: Using tools like DuckDB that support partial reading on Parquet files, it is possible make remote queries without downloading the entire source file, unlike with regular flat file formats like .tsv.

Partial reading makes MGnify Proteins more accessible to users who are unable to download the increasingly massive flat files, as it is now possible to make your query over HTTPS directly by using the relevant Parquet file. In this documentation, we will go through example queries of how you can remotely query MGnify Proteins using DuckDB.

General guidelines

In the regular MGnify Proteins flat files, identifiers usually start with a prefix that changes depending on the data at hand, e.g. MGYP for proteins, MGYC for contigs. In parquet however, these prefixes have been stripped to convert their data types to numeric. This change allows for faster queries and more optimised storage.

Also, while Parquet allows for significant optimisations in query speeds, certain queries are likely to still be slow without the use of significant computational resources, as a full-file scan could still be required for the columns of interest. Example queries could be:

Sorting queries - we have done our best to preemptively sort the parquet files using numeric identifers where sensible to speed up potentially common queries.
Distinct queries - similar to sorting queries, if you need to remove duplicates from results, it could get computationally expensive depending on the size of the initial result.
Summary statistic queries - calculating means/medians/etc of numeric values will require a pass of the entire data, and could end up being slow.
Large table joins - Parquet files can be utilised in similar ways to tables and joined in queries. Depending on the complexity of the join, such queries could get computationally expensive very easily.

Parquet files available

As part of the 2026_07 release, we distribute eight Parquet files.

mgy_proteins_metadata.parquet

All metadata related to proteins, assemblies, contigs, etc, for the entire protein database, in one Parquet file.

Column descriptions:

occurrence_id: Identifier for a protein occurrence, which is a function of protein_id and contig_id i.e. it’s an occurrence of a protein sequence on a specific contig
study_id: Identifier for an ENA study, which corresponds to an internal counter of the study_accession. Note that this is not the same integer as the study’s MGYS accession.
assembly_id: Identifier for an ENA assembly, which corresponds to an assembly_accession
contig_id: Identifier for a contig within an assembly, which corresponds to a contig_name. In flat files, contig_ids are preceded by the prefix MGYC, e.g MGYC000120576919 has the contig_id 120576919
protein_id: Identifier for a non-redundant protein sequence which appears in one or more contigs (and therefore assemblies). In flat files, protein_ids are preceded by the prefix MGYP, e.g MGYP000004750550 has the protein_id 4750550
study_accession: ENA accession for a study
assembly_accession: ENA accession for an assembly
pipeline_version: The MGnify analysis pipeline version that ran the combined gene caller to predict proteins. Current versions are 4.1 and 5.0.
biome: The biome for the protein occurrence
contig_name: The name for a specific contig in a specific assembly i.e. the header the contig had in the FASTA file for the assembly
contig_length: The length of the contig in pair of bases
kmer_coverage: The metaSPAdes kmer_coverage - will be null for non-metaspades-assembled data.
truncation: The partial truncation value from Prodigal i.e. 00 is a full-length sequence, everything else is partial. Will be null for proteins coming from FragGeneScan
start_position: The start position of the called gene/protein on the contig
end_position: The end position of the called gene/protein on the contig
strand: The strand of the called gene/protein on the contig i.e. 1 is the forward orientation, -1 is the reverse orientation
gene_caller_id: Identifier for the gene_caller that was used for the called gene/protein i.e. 1 is Prodigal (v2.6.3), 2 is FragGeneScan (v1.31), and 3 is FragGeneScan (v1.20)
cluster_rep: The cluster representative for a protein
is_cluster_rep: A boolean for whether a protein is a cluster representative or not

occurrence_id	study_id	assembly_id	contig_id	protein_id	study_accession	assembly_accession	contig_name	contig_length	kmer_coverage	start_position	end_position	strand	biome	pipeline_version	truncation	gene_caller_id	cluster_rep	is_cluster_rep
6966728250	1775	59597	2860239544	9680590378	ERP145406	ERZ17082817	ERZ17082817.233518-NODE-783534-length-1148-cov-5.0000	1148	5.0	3	1148	-1	root:Environmental:Aquatic:Marine	5.0	11	1	9894367754	false

mgy_protein_sequences.parquet

All non-redundant protein sequences in one Parquet file, including information about whether a sequence is full-length or not.

Column descriptions:

protein_id: Identifier for a non-redundant protein sequence
full_length: A boolean i.e. 1 is full-length, 0 is partial
sequence: The amino acid sequence for the protein

protein_id	full_length	sequence
7700404008	false	SYSTSHDERTDEDEIVDGPLNGEGQGFIMELSDTERGCIGTYYRVVTYYDRRESADDLKECWWFDDYQQGRL
7699836709	false	MPSKFNNLAHAGRGGDVLVHGVHCHDLHQPLLLQKEEPGSQSEIL
7699854978	true	MGNFDIIDSHLHVFDLNVKANFPNQNRSHEFPNQEPIVRNISQEYAKEIANKSGVKKVVFVMCYDDCPEEAQWVYDNAQKTGNGKFVIKNVFLICY
7699879094	false	EGKNLERNSKFLKVSRTLITIFIYHSYEFPSLDFYYRSIVSSIENSTLRVKHKFIVYGHTIFLIPIFHALSLIWDLHVNEQYIYIYTNVPGIIEIIHKLINDAFYFDSCWLILFLMFVQSGITRGIAPNMTWNH
7700267793	true	MTNFIFTNRFGSPHNPAAINRAIKIIVDAHNADEEVAAKKEKREPVIIPKFSCHIFRHTFASRFCENETNIKVIQEVMGHADVSTTMNIYAEVNDEVTKTAIENLAKNMDVF

mgy_cluster_seqs.parquet

A list of cluster representatives with their cluster member sequences.

Column descriptions:

cluster_rep_accession: The protein_id for the cluster representative
member_accessions: The protein_id of a cluster member

cluster_rep	cluster_member
12	12
32	32
277	277
277	954756431
277	1079143076

mgy_clusters.parquet

Statistics and biome information for clusters.

Column descriptions:

cluster_rep_accession: The protein_id of the cluster representative
cluster_size: The number of sequences in the cluster
assembly_count: The number of assemblies contributing to the cluster
cluster_rep_biomes: A non-redundant list of biomes for the cluster representative
cluster_biomes: A non-redundant list of biomes across the whole cluster

cluster_rep	cluster_size	cluster_assembly_count	cluster_rep_biomes	cluster_members_biomes
166	1	1	root:Host-associated:Human:Digestive system:Oral:Saliva	root:Host-associated:Human:Digestive system:Oral:Saliva
353	3	12	root:Host-associated:Human:Digestive system:Large intestine:Fecal	root:Host-associated:Human:Digestive system:Large intestine:Fecal
354	1	2	root:Host-associated:Human:Digestive system:Large intestine:Fecal	root:Host-associated:Human:Digestive system:Large intestine:Fecal
491	2	2	root:Host-associated:Mammals:Digestive system:Fecal	root:Host-associated:Human:Digestive system;root:Host-associated:Mammals:Digestive system:Fecal
528	1	2	root:Host-associated:Human:Digestive system:Large intestine:Fecal	root:Host-associated:Human:Digestive system:Large intestine:Fecal

mgy_biomes.parquet

Biomes associated with each sequence. Sequences identified in multiple biomes will have multiple rows (one per biome) in the table.

Column descriptions:

protein_id: Identifier for a non-redundant protein sequence
lineage_count: The number of times a particular protein has appeared in a particular lineage
biome: The biome for the protein

protein_id	lineage_count	biome
1	2	root:Engineered:Bioreactor
1	5	root:Engineered:Bioremediation:Terephthalate:Wastewater
1	6	root:Engineered:Built environment
1	3	root:Engineered:Wastewater
1	6	root:Host-associated:Fish:Digestive system

mgy_biome_counts.parquet

Observation counts for each biome.

Column descriptions:

mgyp_count: The number of observations for the biome
lineage_prefix: The biome for the protein

mgyp_count	lineage_prefix
6048290337	root
330559592	root:Engineered
14332511	root:Engineered:Biogas plant
3415718	root:Engineered:Biogas plant:Wet fermentation
53185035	root:Engineered:Bioreactor

mgy_proteins_pfam.parquet

Pfam annotations generated by HMMER for all MGYPs that result in an annotation.

Column descriptions:

protein_id: Identifier for a non-redundant protein sequence
pfam_accession: The matched Pfam-A accession
i_evalue: The independent e-value for the match
score: The bitscore for the match
hmm_from: The start of the match with respect to the HMM for the Pfam hit
hmm_to: The end of the match with respect to the HMM for the Pfam hit
env_from: The start of the match with respect to the envelope for the Pfam hit
env_to: The start of the match with respect to the envelope for the Pfam hit

protein_id	pfam_accession	i_evalue	score	hmm_from	hmm_to	env_from	env_to
1	2566	5.812243921018791e-19	76.21484375	2	100	29	128
5	8956	1.0264838383714224e-30	113.23670196533203	2	59	2	60
6	7733	1.880480774071856e-10	48.71669006347656	198	260	1025	1102
6	7733	1.0008882501161678e-40	147.94119262695312	3	152	832	1038
6	17657	2.12528020408448e-25	96.80313110351562	1	147	1105	1256

mgy_counts.parquet

Observation counts for each MGYP.

Column descriptions:

mgyp: The MGYP accession for a non-redundant protein sequence
mgyp_count: The number of observations for the protein

mgyp	mgyp_count
1	282
2	1
3	10
5	5261
6	3

Example DuckDB queries

We use DuckDB for querying the Parquet files in the examples of this section, as it is fast, simple to use, and has APIs for popular languages like Python. However, there are other methods like Polars that can be used to interact with Parquet files. Please see the DuckDB website and documentation for more information about how to install and use DuckDB.

Warning

When querying the MGnify Proteins Parquet files remotely, the speed will be heavily impacted by the network and HTTPS, especially for results made up of thousands of rows. To truly take advantage of the Parquet files, we highly recommend downloading them either locally, or to a webserver that allows you to query them without being impeded by network issues.

Remote URL

The examples use a common URL to access the EMBL-EBI FTP that hosts the MGnify Proteins parquet files. This URL is: https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_07.

Get MGYP sequences

One MGYP

If you have a particular MGYP accession, let’s say MGYP000000000001, and you want its amino acid sequence, you can do so with the following query:

SELECT *
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_07/mgy_protein_sequences.parquet'
WHERE protein_id = 1;

Output:

protein_id	full_length	sequence
1	true	MQARVKWVEGLTFIGESASGHQILMDGNSGDKAPSPMEMVLMAAGGCSAIDVVSILQKGRHEVTNCEVKLTSERREEAPRLFTHINLHFIVTGKALKDAAVSRAVDLSAEKYCSVALMLEKAVKITHSYEVIEA

Multiple MGYPs

SELECT * 
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_07/mgy_protein_sequences.parquet'
WHERE protein_id IN (1,2,3);

Output:

protein_id	full_length	sequence
1	true	MQARVKWVEGLTFIGESASGHQILMDGNSGDKAPSPMEMVLMAAGGCSAIDVVSILQKGRHEVTNCEVKLTSERREEAPRLFTHINLHFIVTGKALKDAAVSRAVDLSAEKYCSVALMLEKAVKITHSYEVIEA
2	false	MNSLKMIELSLTDVVIRMALAVVFGALIGLERSIKRKGFGISSNAILCLASCTISILQIQSVDILVDVVKQNSALASIISMDITRYGAQVISGVGFLGAGIIVFRERKVSGLTTAVMMWNVTIIGLVIGMGFLT
3	true	MKKVVLTIALFSTFVISAKATTVYAEIADDTHGTSIAKVGLEKQDETDEPTDPIDPPDVPGTYIPTGNTGALRIDYISNIDFGTQKIASETKNYTAGNSDEFVETQISDLRGNGAGWNLQVSYDSEKAGFYTENGVALAGAELSLPAGTAKTVTENQSPAAETATVTVNKDAQNIMFAAATTGLGTWEDQMSAEAVSLKVPSGNLAGTYSATLVWTLTDAPT

Get an assembly’s proteins

If you have an ENA assembly accession, like ERZ23872034, you can find all of its MGYPs like this:

SELECT protein_id
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_07/mgy_proteins_metadata.parquet'
WHERE assembly_accession = 'ERZ1022766'
LIMIT 500; -- This will limit to 500 results, remove if you want all results

Output:

protein_id
7099172386
7099172390
7099172757
7099173182
7099173275
7099173432
7099173507
7099173723
7099173738
7099173756
7099173897
7099174147

Get all biomes and biome counts for a protein

If you want to get all the biomes MGYP000000000001 is in, along with how often it appears in each biome, you can do so with the following query:

SELECT *
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_07/mgy_biomes.parquet'
WHERE protein_id = 1;

Output:

protein_id	lineage_count	biome
1	2	root:Engineered:Bioreactor
1	5	root:Engineered:Bioremediation:Terephthalate:Wastewater
1	6	root:Engineered:Built environment
1	3	root:Engineered:Wastewater
1	6	root:Host-associated:Fish:Digestive system
1	58	root:Host-associated:Human:Digestive system
1	3	root:Host-associated:Human:Digestive system:Hindgut:Rectum
1	32	root:Host-associated:Human:Digestive system:Large intestine
1	132	root:Host-associated:Human:Digestive system:Large intestine:Fecal
1	7	root:Host-associated:Human:Digestive system:Oral
1	3	root:Host-associated:Human:Skin
1	15	root:Host-associated:Insecta:Digestive system
1	2	root:Host-associated:Mammals:Digestive system:Fecal
1	1	root:Host-associated:Mammals:Gastrointestinal tract:Intestine
1	1	root:Host-associated:Plants:Phylloplane
1	6	root:Mixed

Get cluster representative information

Get all members of a representative

If you have a MGYP that you know is a cluster representative, and you want to fetch all of its members, you can do this:

SELECT *
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_07/mgy_cluster_seqs.parquet'
WHERE cluster_rep = 21;

Output:

cluster_rep	cluster_member
21	21
21	24867089
21	84453492
21	342646082
21	414100226
21	461035109
21	510538179
21	526152225
21	1058480184
21	1136948538
21	1205042829
21	7354554132
21	8096648057
21	8164202599
21	8315657225
21	8763227604
21	8799920023
21	8998638882
21	9254161078
21	9753012727
21	10044486010
21	10470715852
21	10629105130

Citation

BibTeX citation:

@online{2026,
  author = {, MGnify},
  title = {Remote Querying of {MGnify} {Proteins} Using {Parquet} and
    {DuckDB}},
  date = {2026-07-13},
  url = {https://docs.mgnify.org/src/docs/mgnify-proteins-parquet-queries.html},
  langid = {en}
}

For attribution, please cite this work as:

MGnify. 2026. “Remote Querying of MGnify Proteins Using Parquet and DuckDB.” July 13. https://docs.mgnify.org/src/docs/mgnify-proteins-parquet-queries.html.