Predicate Pushdown in DuckDB for a Parquet file in S3

Question

I have a single parquet file in S3 (not a partitioned one).

I need to query like this - "select * from read_parquet('s3://....') where colA=1 and colB=2" from a ECS container (where I've installed DuckDB)

Will DuckDB read the entire parquet file into memory and then apply the filters on colA and colB (or) Will DuckDB be able to selectively read records from the parquet file by leveraging the parquet metadata ?

I know that predicate pushdown happens for parquet files available in a local filesystem but unsure about S3.

To do something like this, it has to figure out the scanning range for the metadata section in parquet file and then selectively read row groups as well. So, I'm not sure if it works this way

If I have a partitioned parquet folder, will duckdb be able to automatically select the data from the right partition based on the predicates in the query

I've seen a few answers where I understand that it is possible to query partitioned parquet using DuckDB. But again I'm not sure if that is applicable when querying from S3

Any documentation or pointers to the code would be really useful !

Reading partitioned parquet files in DuckDB

Thanks

came across this discussion after posting the question - https://github.com/duckdb/duckdb/discussions/4559?sort=top : looks close to what I'm looking for.. But would like to see the code or some documentation for this if possible — Srinivas K, Jul 16 '23 at 00:37

score 1 · Accepted Answer · answered Jul 19 '23 at 07:36

DuckDB will be able to perform predicate pushdown on all filesystems that can do range reads. On S3 (and also regular http(s)) the HTTP range header is used to first read the meta data then only download the parts of the parquet file that are required for the query. Note that DuckDB contains a prefetching mechanism the ensure the total amount of requests done is kept within a reasonable amount.

A few pointers to some relevant code:

Prefetching mechanism: https://github.com/duckdb/duckdb/blob/9b0a6350ab051713b619f488cb5de4ef1a85e850/extension/parquet/include/thrift_tools.hpp#L53
Parquet reader checking stats to skip row groups here: https://github.com/duckdb/duckdb/blob/9b0a6350ab051713b619f488cb5de4ef1a85e850/extension/parquet/parquet_reader.cpp#L570

Predicate Pushdown in DuckDB for a Parquet file in S3

1 Answers1