I have a single parquet file in S3 (not a partitioned one).
I need to query like this - "select * from read_parquet('s3://....') where colA=1 and colB=2
" from a ECS container (where I've installed DuckDB)
- Will DuckDB read the entire parquet file into memory and then apply the filters on colA and colB (or) Will DuckDB be able to selectively read records from the parquet file by leveraging the parquet metadata ?
I know that predicate pushdown happens for parquet files available in a local filesystem but unsure about S3.
To do something like this, it has to figure out the scanning range for the metadata section in parquet file and then selectively read row groups as well. So, I'm not sure if it works this way
- If I have a partitioned parquet folder, will duckdb be able to automatically select the data from the right partition based on the predicates in the query
I've seen a few answers where I understand that it is possible to query partitioned parquet using DuckDB. But again I'm not sure if that is applicable when querying from S3
Any documentation or pointers to the code would be really useful !
Related :
Thanks