Remote querying of MGnify Proteins using Parquet and DuckDB
The shift to Parquet
As of the 2026_06 release, we have started shifting towards the use of files in the Apache Parquet format for the distribution of MGnify Proteins sequence data and metadata. There are numerous advantages to this shift:
- Columnar data access: As Parquet is a columnar data format, certain query structures that are commonly used to search MGnify Proteins metadata become significantly faster, as only columns of interest need to be searched. Columnar data similarly allows for partial reading of remote files.
- Native schemas: Parquet stores metadata about each column within a file - including data types - much like a relational database. These schemas allows for a more explicit data structure, more robust data validation, and more efficient queries e.g. querying based on integer identifiers in a Parquet file vs flat-files without a schema like .tsv, the possibility of applying Bloom filters, etc.
- Native compression: Parquet is a naturally compressed file format, with data types allowing for more efficient storing of this data.
- Partial reading: Using tools like DuckDB that support partial reading on Parquet files, it is possible make remote queries without downloading the entire source file, unlike with regular flat file formats like .tsv.
Partial reading makes MGnify Proteins more accessible to users who are unable to download the increasingly massive flat files, as it is now possible to make your query over HTTPS directly by using the relevant Parquet file. In this documentation, we will go through example queries of how you can remotely query MGnify Proteins using DuckDB.
General guidelines
In the regular MGnify Proteins flat files, identifiers usually start with a prefix that changes depending on the data at hand, e.g. MGYP for proteins, MGYC for contigs. In parquet however, these prefixes have been stripped to convert their data types to numeric. This change allows for faster queries and more optimised storage.
Also, while Parquet allows for significant optimisations in query speeds, certain queries are likely to still be slow without the use of significant computational resources, as a full-file scan could still be required for the columns of interest. Example queries could be:
- Sorting queries - we have done our best to preemptively sort the parquet files using numeric identifers where sensible to speed up potentially common queries.
- Distinct queries - similar to sorting queries, if you need to remove duplicates from results, it could get computationally expensive depending on the size of the initial result.
- Summary statistic queries - calculating means/medians/etc of numeric values will require a pass of the entire data, and could end up being slow.
- Large table joins - Parquet files can be utilised in similar ways to tables and joined in queries. Depending on the complexity of the join, such queries could get computationally expensive very easily.
Parquet files available
As part of the 2026_06 release, we distribute eight Parquet files.
mgy_proteins_metadata.parquet
All metadata related to proteins, assemblies, contigs, etc, for the entire protein database, in one Parquet file.
Column descriptions:
- occurrence_id: Identifier for a protein occurrence, which is a function of protein_id and contig_id i.e. it’s an occurrence of a protein sequence on a specific contig
- study_id: Identifier for an ENA study, which corresponds to an internal counter of the study_accession. Note that this is not the same integer as the study’s MGYS accession.
- assembly_id: Identifier for an ENA assembly, which corresponds to an assembly_accession
- contig_id: Identifier for a contig within an assembly, which corresponds to a contig_name. In flat files, contig_ids are preceded by the prefix MGYC, e.g
MGYC000120576919has the contig_id120576919 - protein_id: Identifier for a non-redundant protein sequence which appears in one or more contigs (and therefore assemblies). In flat files, protein_ids are preceded by the prefix MGYP, e.g
MGYP000004750550has the protein_id4750550 - study_accession: ENA accession for a study
- assembly_accession: ENA accession for an assembly
- pipeline_version: The MGnify analysis pipeline version that ran the combined gene caller to predict proteins. Current versions are 4.1 and 5.0.
- lineage: The biome for the protein occurrence
- contig_name: The name for a specific contig in a specific assembly i.e. the header the contig had in the FASTA file for the assembly
- contig_length: The length of the contig in pair of bases
- kmer_coverage: The metaSPAdes kmer_coverage - will be null for non-metaspades-assembled data.
- truncation: The partial truncation value from Prodigal i.e. 00 is a full-length sequence, everything else is partial. Will be null for proteins coming from FragGeneScan
- start_position: The start position of the called gene/protein on the contig
- end_position: The end position of the called gene/protein on the contig
- strand: The strand of the called gene/protein on the contig i.e. 1 is the forward orientation, -1 is the reverse orientation
- gene_caller_id: Identifier for the gene_caller that was used for the called gene/protein i.e. 1 is Prodigal (v2.6.3), 2 is FragGeneScan (v1.31), and 3 is FragGeneScan (v1.20)
- cluster_rep: The cluster representative for a protein
- is_cluster_rep: A boolean for whether a protein is a cluster representative or not
| occurrence_id | study_id | assembly_id | contig_id | protein_id | study_accession | assembly_accession | pipeline_version | lineage | contig_name | contig_length | kmer_coverage | truncation | start_position | end_position | strand | gene_caller_id | cluster_rep | is_cluster_rep |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6966728250 | 1916 | 68124 | 4225793659 | 5363 | ERP146353 | ERZ24775148 | 5.0 | root:Host-associated:Human:Digestive system:Large intestine:Fecal | ERZ27249607.747904-NODE-747904-length-578-cov-1.994264 | 578 | 1.994264 | 00 | 24468 | 25016 | 1 | 1 | 3381439768 | false |
mgy_protein_sequences.parquet
All non-redundant protein sequences in one Parquet file, including information about whether a sequence is full-length or not.
Column descriptions:
- protein_id: Identifier for a non-redundant protein sequence
- full_length: A boolean i.e. 1 is full-length, 0 is partial
- sequence: The amino acid sequence for the protein
| protein_id | full_length | sequence |
|---|---|---|
| 7700404008 | false | SYSTSHDERTDEDEIVDGPLNGEGQGFIMELSDTERGCIGTYYRVVTYYDRRESADDLKECWWFDDYQQGRL |
| 7699836709 | false | MPSKFNNLAHAGRGGDVLVHGVHCHDLHQPLLLQKEEPGSQSEIL |
| 7699854978 | true | MGNFDIIDSHLHVFDLNVKANFPNQNRSHEFPNQEPIVRNISQEYAKEIANKSGVKKVVFVMCYDDCPEEAQWVYDNAQKTGNGKFVIKNVFLICY |
| 7699879094 | false | EGKNLERNSKFLKVSRTLITIFIYHSYEFPSLDFYYRSIVSSIENSTLRVKHKFIVYGHTIFLIPIFHALSLIWDLHVNEQYIYIYTNVPGIIEIIHKLINDAFYFDSCWLILFLMFVQSGITRGIAPNMTWNH |
| 7700267793 | true | MTNFIFTNRFGSPHNPAAINRAIKIIVDAHNADEEVAAKKEKREPVIIPKFSCHIFRHTFASRFCENETNIKVIQEVMGHADVSTTMNIYAEVNDEVTKTAIENLAKNMDVF |
mgy_cluster_seqs.parquet
A list of cluster representatives with their cluster member sequences.
Column descriptions:
- cluster_rep_accession: The protein_id for the cluster representative
- member_accessions: The protein_id of a cluster member
| cluster_rep | cluster_member |
|---|---|
| 12 | 12 |
| 32 | 32 |
| 277 | 277 |
| 277 | 954756431 |
| 277 | 1079143076 |
mgy_clusters.parquet
Statistics and biome information for clusters.
Column descriptions:
- cluster_rep_accession: The protein_id of the cluster representative
- cluster_size: The number of sequences in the cluster
- assembly_count: The number of assemblies contributing to the cluster
- cluster_rep_biomes: A non-redundant list of biomes for the cluster representative
- cluster_biomes: A non-redundant list of biomes across the whole cluster
| cluster_rep | cluster_size | cluster_assembly_count | cluster_rep_biomes | cluster_members_biomes |
|---|---|---|---|---|
| 166 | 1 | 1 | root:Host-associated:Human:Digestive system:Oral:Saliva | root:Host-associated:Human:Digestive system:Oral:Saliva |
| 353 | 3 | 12 | root:Host-associated:Human:Digestive system:Large intestine:Fecal | root:Host-associated:Human:Digestive system:Large intestine:Fecal |
| 354 | 1 | 2 | root:Host-associated:Human:Digestive system:Large intestine:Fecal | root:Host-associated:Human:Digestive system:Large intestine:Fecal |
| 491 | 2 | 2 | root:Host-associated:Mammals:Digestive system:Fecal | root:Host-associated:Human:Digestive system;root:Host-associated:Mammals:Digestive system:Fecal |
| 528 | 1 | 2 | root:Host-associated:Human:Digestive system:Large intestine:Fecal | root:Host-associated:Human:Digestive system:Large intestine:Fecal |
mgy_biomes.parquet
Biomes associated with each sequence. Sequences identified in multiple biomes will have multiple rows (one per biome) in the table.
Column descriptions:
- protein_id: Identifier for a non-redundant protein sequence
- lineage_count: The number of times a particular protein has appeared in a particular lineage
- lineage: The biome for the protein
| protein_id | lineage_count | lineage |
|---|---|---|
| 1 | 2 | root:Engineered:Bioreactor |
| 1 | 13 | root:Host-associated:Human:Digestive system:Large intestine |
| 1 | 155 | root:Host-associated:Human:Digestive system:Large intestine:Fecal |
| 2 | 1 | root:Host-associated:Human:Digestive system:Large intestine:Fecal |
| 3 | 1 | root:Host-associated:Human:Digestive system |
mgy_biome_counts.parquet
Observation counts for each biome.
Column descriptions:
- mgyp_count: The number of observations for the biome
- lineage_prefix: The biome for the protein
| mgyp_count | lineage_prefix |
|---|---|
| 6047984260 | root |
| 330457823 | root:Engineered |
| 14332439 | root:Engineered:Biogas plant |
| 3415679 | root:Engineered:Biogas plant:Wet fermentation |
| 53184879 | root:Engineered:Bioreactor |
mgy_proteins_pfam.parquet
Pfam annotations generated by HMMER for all MGYPs that result in an annotation.
Column descriptions:
- protein_id: Identifier for a non-redundant protein sequence
- pfam_accession: The matched Pfam-A accession
- i_evalue: The independent e-value for the match
- score: The bitscore for the match
- hmm_from: The start of the match with respect to the HMM for the Pfam hit
- hmm_to: The end of the match with respect to the HMM for the Pfam hit
- env_from: The start of the match with respect to the envelope for the Pfam hit
- env_to: The start of the match with respect to the envelope for the Pfam hit
| protein_id | pfam_accession | i_evalue | score | hmm_from | hmm_to | env_from | env_to |
|---|---|---|---|---|---|---|---|
| 1 | 2566 | 5.812243921018791e-19 | 76.21484375 | 2 | 100 | 29 | 128 |
| 5 | 8956 | 1.0264838383714224e-30 | 113.23670196533203 | 2 | 59 | 2 | 60 |
| 6 | 7733 | 1.880480774071856e-10 | 48.71669006347656 | 198 | 260 | 1025 | 1102 |
| 6 | 7733 | 1.0008882501161678e-40 | 147.94119262695312 | 3 | 152 | 832 | 1038 |
| 6 | 17657 | 2.12528020408448e-25 | 96.80313110351562 | 1 | 147 | 1105 | 1256 |
mgy_counts.parquet
Observation counts for each MGYP.
Column descriptions:
- mgyp: The MGYP accession for a non-redundant protein sequence
- mgyp_count: The number of observations for the protein
| mgyp | mgyp_count |
|---|---|
| 1 | 281 |
| 2 | 1 |
| 3 | 10 |
| 5 | 5236 |
| 6 | 3 |
Example DuckDB queries
We use DuckDB for querying the Parquet files in the examples of this section, as it is fast, simple to use, and has APIs for popular languages like Python. However, there are other methods like Polars that can be used to interact with Parquet files. Please see the DuckDB website and documentation for more information about how to install and use DuckDB.
When querying the MGnify Proteins Parquet files remotely, the speed will be heavily impacted by the network and HTTPS, especially for results made up of thousands of rows. To truly take advantage of the Parquet files, we highly recommend downloading them either locally, or to a webserver that allows you to query them without being impeded by network issues.
Remote URL
The examples use a common URL to access the EMBL-EBI FTP that hosts the MGnify Proteins parquet files. This URL is: https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06.
Get MGYP sequences
One MGYP
If you have a particular MGYP accession, let’s say MGYP000000000001, and you want its amino acid sequence, you can do so with the following query:
SELECT *
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06/mgy_protein_sequences.parquet'
WHERE protein_id = 1;Output:
| protein_id | full_length | sequence |
|---|---|---|
| 1 | true | MQARVKWVEGLTFIGESASGHQILMDGNSGDKAPSPMEMVLMAAGGCSAIDVVSILQKGRHEVTNCEVKLTSERREEAPRLFTHINLHFIVTGKALKDAAVSRAVDLSAEKYCSVALMLEKAVKITHSYEVIEA |
Multiple MGYPs
SELECT *
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06/mgy_protein_sequences.parquet'
WHERE protein_id IN (1,2,3);Output:
| protein_id | full_length | sequence |
|---|---|---|
| 1 | true | MQARVKWVEGLTFIGESASGHQILMDGNSGDKAPSPMEMVLMAAGGCSAIDVVSILQKGRHEVTNCEVKLTSERREEAPRLFTHINLHFIVTGKALKDAAVSRAVDLSAEKYCSVALMLEKAVKITHSYEVIEA |
| 2 | false | MNSLKMIELSLTDVVIRMALAVVFGALIGLERSIKRKGFGISSNAILCLASCTISILQIQSVDILVDVVKQNSALASIISMDITRYGAQVISGVGFLGAGIIVFRERKVSGLTTAVMMWNVTIIGLVIGMGFLT |
| 3 | true | MKKVVLTIALFSTFVISAKATTVYAEIADDTHGTSIAKVGLEKQDETDEPTDPIDPPDVPGTYIPTGNTGALRIDYISNIDFGTQKIASETKNYTAGNSDEFVETQISDLRGNGAGWNLQVSYDSEKAGFYTENGVALAGAELSLPAGTAKTVTENQSPAAETATVTVNKDAQNIMFAAATTGLGTWEDQMSAEAVSLKVPSGNLAGTYSATLVWTLTDAPT |
Get an assembly’s proteins
If you have an ENA assembly accession, like ERZ23872034, you can find all of its MGYPs like this:
SELECT protein_id
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06/mgy_proteins_metadata.parquet'
WHERE assembly_accession = 'ERZ23872034'
LIMIT 500; -- This will limit to 500 results, remove if you want all resultsOutput:
| protein_id |
|---|
| 1 |
| 17 |
| 19 |
| 24 |
| 33 |
| 53 |
| 61 |
| 64 |
| 80 |
| 86 |
| 100 |
| 105 |
Get all biomes and biome counts for a protein
If you want to get all the biomes MGYP000000000001 is in, along with how often it appears in each biome, you can do so with the following query:
SELECT *
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06/mgy_biomes.parquet'
WHERE protein_id = 1;Output:
| protein_id | lineage_count | lineage |
|---|---|---|
| 1 | 2 | root:Engineered:Bioreactor |
| 1 | 5 | root:Engineered:Bioremediation:Terephthalate:Wastewater |
| 1 | 6 | root:Engineered:Built environment |
| 1 | 3 | root:Engineered:Wastewater |
| 1 | 6 | root:Host-associated:Fish:Digestive system |
| 1 | 53 | root:Host-associated:Human:Digestive system |
| 1 | 3 | root:Host-associated:Human:Digestive system:Hindgut:Rectum |
| 1 | 13 | root:Host-associated:Human:Digestive system:Large intestine |
| 1 | 155 | root:Host-associated:Human:Digestive system:Large intestine:Fecal |
| 1 | 7 | root:Host-associated:Human:Digestive system:Oral |
| 1 | 3 | root:Host-associated:Human:Skin |
| 1 | 15 | root:Host-associated:Insecta:Digestive system |
| 1 | 2 | root:Host-associated:Mammals:Digestive system:Fecal |
| 1 | 1 | root:Host-associated:Mammals:Gastrointestinal tract:Intestine |
| 1 | 1 | root:Host-associated:Plants:Phylloplane |
| 1 | 6 | root:Mixed |
Get cluster representative information
Get all members of a representative
If you have a MGYP that you know is a cluster representative, and you want to fetch all of its members, you can do this:
SELECT *
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06/mgy_cluster_seqs.parquet'
WHERE cluster_rep = 21;Output:
| cluster_rep | cluster_member |
|---|---|
| 21 | 24867089 |
| 21 | 8799920023 |
| 21 | 9254161078 |
| 21 | 10470715852 |
| 21 | 526152225 |
| 21 | 342646082 |
| 21 | 21 |
| 21 | 7354554132 |
| 21 | 510538179 |
| 21 | 1205042829 |
| 21 | 461035109 |
| 21 | 1058480184 |
| 21 | 8315657225 |
| 21 | 84453492 |
| 21 | 8763227604 |
| 21 | 10629105130 |
| 21 | 9753012727 |
| 21 | 414100226 |
| 21 | 8096648057 |
| 21 | 1136948538 |
| 21 | 8164202599 |
| 21 | 10044486010 |
| 21 | 8998638882 |
Citation
@online{2026,
author = {, MGnify},
title = {Remote Querying of {MGnify} {Proteins} Using {Parquet} and
{DuckDB}},
date = {2026-07-03},
url = {https://docs.mgnify.org/src/docs/mgnify-proteins-parquet-queries.html},
langid = {en}
}