Remote querying of MGnify Proteins using Parquet and DuckDB

Description of the MGnify Proteins and related services
Author
Affiliation
Published

July 3, 2026

The shift to Parquet

As of the 2026_06 release, we have started shifting towards the use of files in the Apache Parquet format for the distribution of MGnify Proteins sequence data and metadata. There are numerous advantages to this shift:

  • Columnar data access: As Parquet is a columnar data format, certain query structures that are commonly used to search MGnify Proteins metadata become significantly faster, as only columns of interest need to be searched. Columnar data similarly allows for partial reading of remote files.
  • Native schemas: Parquet stores metadata about each column within a file - including data types - much like a relational database. These schemas allows for a more explicit data structure, more robust data validation, and more efficient queries e.g. querying based on integer identifiers in a Parquet file vs flat-files without a schema like .tsv, the possibility of applying Bloom filters, etc.
  • Native compression: Parquet is a naturally compressed file format, with data types allowing for more efficient storing of this data.
  • Partial reading: Using tools like DuckDB that support partial reading on Parquet files, it is possible make remote queries without downloading the entire source file, unlike with regular flat file formats like .tsv.

Partial reading makes MGnify Proteins more accessible to users who are unable to download the increasingly massive flat files, as it is now possible to make your query over HTTPS directly by using the relevant Parquet file. In this documentation, we will go through example queries of how you can remotely query MGnify Proteins using DuckDB.

General guidelines

In the regular MGnify Proteins flat files, identifiers usually start with a prefix that changes depending on the data at hand, e.g. MGYP for proteins, MGYC for contigs. In parquet however, these prefixes have been stripped to convert their data types to numeric. This change allows for faster queries and more optimised storage.

Also, while Parquet allows for significant optimisations in query speeds, certain queries are likely to still be slow without the use of significant computational resources, as a full-file scan could still be required for the columns of interest. Example queries could be:

  • Sorting queries - we have done our best to preemptively sort the parquet files using numeric identifers where sensible to speed up potentially common queries.
  • Distinct queries - similar to sorting queries, if you need to remove duplicates from results, it could get computationally expensive depending on the size of the initial result.
  • Summary statistic queries - calculating means/medians/etc of numeric values will require a pass of the entire data, and could end up being slow.
  • Large table joins - Parquet files can be utilised in similar ways to tables and joined in queries. Depending on the complexity of the join, such queries could get computationally expensive very easily.

Parquet files available

As part of the 2026_06 release, we distribute eight Parquet files.

mgy_proteins_metadata.parquet

All metadata related to proteins, assemblies, contigs, etc, for the entire protein database, in one Parquet file.

Column descriptions:

  • occurrence_id: Identifier for a protein occurrence, which is a function of protein_id and contig_id i.e. it’s an occurrence of a protein sequence on a specific contig
  • study_id: Identifier for an ENA study, which corresponds to an internal counter of the study_accession. Note that this is not the same integer as the study’s MGYS accession.
  • assembly_id: Identifier for an ENA assembly, which corresponds to an assembly_accession
  • contig_id: Identifier for a contig within an assembly, which corresponds to a contig_name. In flat files, contig_ids are preceded by the prefix MGYC, e.g MGYC000120576919 has the contig_id 120576919
  • protein_id: Identifier for a non-redundant protein sequence which appears in one or more contigs (and therefore assemblies). In flat files, protein_ids are preceded by the prefix MGYP, e.g MGYP000004750550 has the protein_id 4750550
  • study_accession: ENA accession for a study
  • assembly_accession: ENA accession for an assembly
  • pipeline_version: The MGnify analysis pipeline version that ran the combined gene caller to predict proteins. Current versions are 4.1 and 5.0.
  • lineage: The biome for the protein occurrence
  • contig_name: The name for a specific contig in a specific assembly i.e. the header the contig had in the FASTA file for the assembly
  • contig_length: The length of the contig in pair of bases
  • kmer_coverage: The metaSPAdes kmer_coverage - will be null for non-metaspades-assembled data.
  • truncation: The partial truncation value from Prodigal i.e. 00 is a full-length sequence, everything else is partial. Will be null for proteins coming from FragGeneScan
  • start_position: The start position of the called gene/protein on the contig
  • end_position: The end position of the called gene/protein on the contig
  • strand: The strand of the called gene/protein on the contig i.e. 1 is the forward orientation, -1 is the reverse orientation
  • gene_caller_id: Identifier for the gene_caller that was used for the called gene/protein i.e. 1 is Prodigal (v2.6.3), 2 is FragGeneScan (v1.31), and 3 is FragGeneScan (v1.20)
  • cluster_rep: The cluster representative for a protein
  • is_cluster_rep: A boolean for whether a protein is a cluster representative or not
occurrence_id study_id assembly_id contig_id protein_id study_accession assembly_accession pipeline_version lineage contig_name contig_length kmer_coverage truncation start_position end_position strand gene_caller_id cluster_rep is_cluster_rep
6966728250 1916 68124 4225793659 5363 ERP146353 ERZ24775148 5.0 root:Host-associated:Human:Digestive system:Large intestine:Fecal ERZ27249607.747904-NODE-747904-length-578-cov-1.994264 578 1.994264 00 24468 25016 1 1 3381439768 false

mgy_protein_sequences.parquet

All non-redundant protein sequences in one Parquet file, including information about whether a sequence is full-length or not.

Column descriptions:

  • protein_id: Identifier for a non-redundant protein sequence
  • full_length: A boolean i.e. 1 is full-length, 0 is partial
  • sequence: The amino acid sequence for the protein
protein_id full_length sequence
7700404008 false SYSTSHDERTDEDEIVDGPLNGEGQGFIMELSDTERGCIGTYYRVVTYYDRRESADDLKECWWFDDYQQGRL
7699836709 false MPSKFNNLAHAGRGGDVLVHGVHCHDLHQPLLLQKEEPGSQSEIL
7699854978 true MGNFDIIDSHLHVFDLNVKANFPNQNRSHEFPNQEPIVRNISQEYAKEIANKSGVKKVVFVMCYDDCPEEAQWVYDNAQKTGNGKFVIKNVFLICY
7699879094 false EGKNLERNSKFLKVSRTLITIFIYHSYEFPSLDFYYRSIVSSIENSTLRVKHKFIVYGHTIFLIPIFHALSLIWDLHVNEQYIYIYTNVPGIIEIIHKLINDAFYFDSCWLILFLMFVQSGITRGIAPNMTWNH
7700267793 true MTNFIFTNRFGSPHNPAAINRAIKIIVDAHNADEEVAAKKEKREPVIIPKFSCHIFRHTFASRFCENETNIKVIQEVMGHADVSTTMNIYAEVNDEVTKTAIENLAKNMDVF

mgy_cluster_seqs.parquet

A list of cluster representatives with their cluster member sequences.

Column descriptions:

  • cluster_rep_accession: The protein_id for the cluster representative
  • member_accessions: The protein_id of a cluster member
cluster_rep cluster_member
12 12
32 32
277 277
277 954756431
277 1079143076

mgy_clusters.parquet

Statistics and biome information for clusters.

Column descriptions:

  • cluster_rep_accession: The protein_id of the cluster representative
  • cluster_size: The number of sequences in the cluster
  • assembly_count: The number of assemblies contributing to the cluster
  • cluster_rep_biomes: A non-redundant list of biomes for the cluster representative
  • cluster_biomes: A non-redundant list of biomes across the whole cluster
cluster_rep cluster_size cluster_assembly_count cluster_rep_biomes cluster_members_biomes
166 1 1 root:Host-associated:Human:Digestive system:Oral:Saliva root:Host-associated:Human:Digestive system:Oral:Saliva
353 3 12 root:Host-associated:Human:Digestive system:Large intestine:Fecal root:Host-associated:Human:Digestive system:Large intestine:Fecal
354 1 2 root:Host-associated:Human:Digestive system:Large intestine:Fecal root:Host-associated:Human:Digestive system:Large intestine:Fecal
491 2 2 root:Host-associated:Mammals:Digestive system:Fecal root:Host-associated:Human:Digestive system;root:Host-associated:Mammals:Digestive system:Fecal
528 1 2 root:Host-associated:Human:Digestive system:Large intestine:Fecal root:Host-associated:Human:Digestive system:Large intestine:Fecal

mgy_biomes.parquet

Biomes associated with each sequence. Sequences identified in multiple biomes will have multiple rows (one per biome) in the table.

Column descriptions:

  • protein_id: Identifier for a non-redundant protein sequence
  • lineage_count: The number of times a particular protein has appeared in a particular lineage
  • lineage: The biome for the protein
protein_id lineage_count lineage
1 2 root:Engineered:Bioreactor
1 13 root:Host-associated:Human:Digestive system:Large intestine
1 155 root:Host-associated:Human:Digestive system:Large intestine:Fecal
2 1 root:Host-associated:Human:Digestive system:Large intestine:Fecal
3 1 root:Host-associated:Human:Digestive system

mgy_biome_counts.parquet

Observation counts for each biome.

Column descriptions:

  • mgyp_count: The number of observations for the biome
  • lineage_prefix: The biome for the protein
mgyp_count lineage_prefix
6047984260 root
330457823 root:Engineered
14332439 root:Engineered:Biogas plant
3415679 root:Engineered:Biogas plant:Wet fermentation
53184879 root:Engineered:Bioreactor

mgy_proteins_pfam.parquet

Pfam annotations generated by HMMER for all MGYPs that result in an annotation.

Column descriptions:

  • protein_id: Identifier for a non-redundant protein sequence
  • pfam_accession: The matched Pfam-A accession
  • i_evalue: The independent e-value for the match
  • score: The bitscore for the match
  • hmm_from: The start of the match with respect to the HMM for the Pfam hit
  • hmm_to: The end of the match with respect to the HMM for the Pfam hit
  • env_from: The start of the match with respect to the envelope for the Pfam hit
  • env_to: The start of the match with respect to the envelope for the Pfam hit
protein_id pfam_accession i_evalue score hmm_from hmm_to env_from env_to
1 2566 5.812243921018791e-19 76.21484375 2 100 29 128
5 8956 1.0264838383714224e-30 113.23670196533203 2 59 2 60
6 7733 1.880480774071856e-10 48.71669006347656 198 260 1025 1102
6 7733 1.0008882501161678e-40 147.94119262695312 3 152 832 1038
6 17657 2.12528020408448e-25 96.80313110351562 1 147 1105 1256

mgy_counts.parquet

Observation counts for each MGYP.

Column descriptions:

  • mgyp: The MGYP accession for a non-redundant protein sequence
  • mgyp_count: The number of observations for the protein
mgyp mgyp_count
1 281
2 1
3 10
5 5236
6 3

Example DuckDB queries

We use DuckDB for querying the Parquet files in the examples of this section, as it is fast, simple to use, and has APIs for popular languages like Python. However, there are other methods like Polars that can be used to interact with Parquet files. Please see the DuckDB website and documentation for more information about how to install and use DuckDB.

Warning

When querying the MGnify Proteins Parquet files remotely, the speed will be heavily impacted by the network and HTTPS, especially for results made up of thousands of rows. To truly take advantage of the Parquet files, we highly recommend downloading them either locally, or to a webserver that allows you to query them without being impeded by network issues.

Remote URL

The examples use a common URL to access the EMBL-EBI FTP that hosts the MGnify Proteins parquet files. This URL is: https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06.

Get MGYP sequences

One MGYP

If you have a particular MGYP accession, let’s say MGYP000000000001, and you want its amino acid sequence, you can do so with the following query:

SELECT *
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06/mgy_protein_sequences.parquet'
WHERE protein_id = 1;

Output:

protein_id full_length sequence
1 true MQARVKWVEGLTFIGESASGHQILMDGNSGDKAPSPMEMVLMAAGGCSAIDVVSILQKGRHEVTNCEVKLTSERREEAPRLFTHINLHFIVTGKALKDAAVSRAVDLSAEKYCSVALMLEKAVKITHSYEVIEA

Multiple MGYPs

SELECT * 
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06/mgy_protein_sequences.parquet'
WHERE protein_id IN (1,2,3);

Output:

protein_id full_length sequence
1 true MQARVKWVEGLTFIGESASGHQILMDGNSGDKAPSPMEMVLMAAGGCSAIDVVSILQKGRHEVTNCEVKLTSERREEAPRLFTHINLHFIVTGKALKDAAVSRAVDLSAEKYCSVALMLEKAVKITHSYEVIEA
2 false MNSLKMIELSLTDVVIRMALAVVFGALIGLERSIKRKGFGISSNAILCLASCTISILQIQSVDILVDVVKQNSALASIISMDITRYGAQVISGVGFLGAGIIVFRERKVSGLTTAVMMWNVTIIGLVIGMGFLT
3 true MKKVVLTIALFSTFVISAKATTVYAEIADDTHGTSIAKVGLEKQDETDEPTDPIDPPDVPGTYIPTGNTGALRIDYISNIDFGTQKIASETKNYTAGNSDEFVETQISDLRGNGAGWNLQVSYDSEKAGFYTENGVALAGAELSLPAGTAKTVTENQSPAAETATVTVNKDAQNIMFAAATTGLGTWEDQMSAEAVSLKVPSGNLAGTYSATLVWTLTDAPT

Get an assembly’s proteins

If you have an ENA assembly accession, like ERZ23872034, you can find all of its MGYPs like this:

SELECT protein_id
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06/mgy_proteins_metadata.parquet'
WHERE assembly_accession = 'ERZ23872034'
LIMIT 500; -- This will limit to 500 results, remove if you want all results

Output:

protein_id
1
17
19
24
33
53
61
64
80
86
100
105

Get all biomes and biome counts for a protein

If you want to get all the biomes MGYP000000000001 is in, along with how often it appears in each biome, you can do so with the following query:

SELECT *
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06/mgy_biomes.parquet'
WHERE protein_id = 1;

Output:

protein_id lineage_count lineage
1 2 root:Engineered:Bioreactor
1 5 root:Engineered:Bioremediation:Terephthalate:Wastewater
1 6 root:Engineered:Built environment
1 3 root:Engineered:Wastewater
1 6 root:Host-associated:Fish:Digestive system
1 53 root:Host-associated:Human:Digestive system
1 3 root:Host-associated:Human:Digestive system:Hindgut:Rectum
1 13 root:Host-associated:Human:Digestive system:Large intestine
1 155 root:Host-associated:Human:Digestive system:Large intestine:Fecal
1 7 root:Host-associated:Human:Digestive system:Oral
1 3 root:Host-associated:Human:Skin
1 15 root:Host-associated:Insecta:Digestive system
1 2 root:Host-associated:Mammals:Digestive system:Fecal
1 1 root:Host-associated:Mammals:Gastrointestinal tract:Intestine
1 1 root:Host-associated:Plants:Phylloplane
1 6 root:Mixed

Get cluster representative information

Get all members of a representative

If you have a MGYP that you know is a cluster representative, and you want to fetch all of its members, you can do this:

SELECT *
FROM 'https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2026_06/mgy_cluster_seqs.parquet'
WHERE cluster_rep = 21;

Output:

cluster_rep cluster_member
21 24867089
21 8799920023
21 9254161078
21 10470715852
21 526152225
21 342646082
21 21
21 7354554132
21 510538179
21 1205042829
21 461035109
21 1058480184
21 8315657225
21 84453492
21 8763227604
21 10629105130
21 9753012727
21 414100226
21 8096648057
21 1136948538
21 8164202599
21 10044486010
21 8998638882

Citation

BibTeX citation:
@online{2026,
  author = {, MGnify},
  title = {Remote Querying of {MGnify} {Proteins} Using {Parquet} and
    {DuckDB}},
  date = {2026-07-03},
  url = {https://docs.mgnify.org/src/docs/mgnify-proteins-parquet-queries.html},
  langid = {en}
}
For attribution, please cite this work as:
MGnify. 2026. “Remote Querying of MGnify Proteins Using Parquet and DuckDB.” July 3. https://docs.mgnify.org/src/docs/mgnify-proteins-parquet-queries.html.