5

Background:

  • DuckDB allows for direct querying for parquet files. e.g. con.execute("Select * from 'Hierarchy.parquet')

  • Parquet allows files to be partitioned by column values. When a parquet file is paritioned a top level FOLDER is created with the name of the parquet file and subfolders for the column values and these subfolders then contain the actual parquet data files. e.g. Hierarchy.parquet (folder) --> date=20220401 (subfolder) --> part1.parquet

Expected behavior

DuckDB to query partitioned AND unpartitioned parquet files.

Observed behaviour

DuckDB fails when querying partitioned parquet files and works with unpartitioned parquet files.

con.execute("Select * from 'Hierarchy.parquet'") fails with RuntimeError: IO Error: No files found that match the pattern "Hierarchy.parquet" when Hierarchy.parquet is partitioned.

querying the underlying individuals datafiles works fine: con.execute("Select * from 'Hierarchy.parquet/date=20220401/part1.parquet'")

Is there a way to query partitioned parquet files with DuckDB? Or is this a limitation/bug?

tomanizer
  • 851
  • 6
  • 16

3 Answers3

6

This is discussed as a DuckDB issue/feature request in

"Support hive-style partitioning of parquet archives" https://github.com/duckdb/duckdb/issues/2186.

Suggested workaround is:

  • Open the parquet file as an arrow/pyarrow dataset with a partitioning parameter
  • Register the dataset as a view in DuckDB
  • Query the view
import pyarrow.dataset as ds
import duckdb

con = duckdb.connect()

dataset = ds.dataset('Hierarchy.parquet", format="parquet", partitioning="hive")
con.register_arrow("Hierarchy", dataset)
con.execute("Select * from Hierarchy").df()

Direct support by DuckDB would be better... but this works fine and is easy.

tomanizer
  • 851
  • 6
  • 16
  • 3
    As of [version 0.3.4](https://github.com/duckdb/duckdb/commit/f05dfff249d3a33608176066273923d79a8824cf) this is now `register` instead of `register_arrow`. – Tom Phillips Apr 27 '22 at 15:54
  • 1
    Thanks for this extremely helpful bit of code! I can't edit because of the error message "there are too many pending edits on Stack Overflow", but note that `Hierarchy.parquet` has mismatched quotation marks. I also wanted to add for other people new to pyarrow.dataset that the first argument to `ds.dataset` can simply be the path to the directory containing parquet files. – butterflyeffect Nov 17 '22 at 19:27
2

it needs at least DuckDB 0.4.1 to support partition parquet files.

pip install -U --pre duckdb

for example, if you have the parquets partitioned by Species ,

['/Users/steven/data/iris1/Species=setosa/0-06ea69fc-393b-4dc7-8940-46700d407ec4-0.parquet',
 '/Users/steven/data/iris1/Species=virginica/0-06ea69fc-393b-4dc7-8940-46700d407ec4-0.parquet',
 '/Users/steven/data/iris1/Species=versicolor/0-06ea69fc-393b-4dc7-8940-46700d407ec4-0.parquet']

you can query in this way,

select * from parquet_scan(['/Users/steven/data/iris1/Species=setosa/0-06ea69fc-393b-4dc7-8940-46700d407ec4-0.parquet',
 '/Users/steven/data/iris1/Species=virginica/0-06ea69fc-393b-4dc7-8940-46700d407ec4-0.parquet',
 '/Users/steven/data/iris1/Species=versicolor/0-06ea69fc-393b-4dc7-8940-46700d407ec4-0.parquet']
                           , HIVE_PARTITIONING=true);

and you will get, enter image description here

the parameter HIVE_PARTITIONING=true is very important.

witwall
  • 21
  • 2
0

Our objective here is twofold:

  1. Take advantage of Parquet filters to load part of a dataset corresponding to a partition key.
  2. Use DuckDB to write queries on that filtered dataset.

I created a toy Parquet dataset of city data partitioned on state. The source csv file looked like this (there are twenty five rows in total):

rank,city,state,population,2000_2013_growth
1,New York,New York,8405837,4.8%
2,Los Angeles,California,3884307,4.8%
3,Chicago,Illinois,2718782,-6.1%

Import the packages I need:

import duckdb
import pandas as pd
import pyarrow.parquet as pq

My first objective it to get the a subset of the partitioned data into something DuckDB can query, in this example a Pandas dataframe:

parquet_file = f'{parquet_folder}/city_data.parquet'
filter_column = 'State'
states = ['California', 'Michigan', 'New York']
ds = pq.ParquetDataset(parquet_file, filters=[('state','in', states)])
df = ds.read().to_pandas()

Next I'll build a query, a simple example would be:

query = """SELECT * FROM df WHERE State = 'California'"""
results_df = duckdb.query(query).to_df()

I just started working with Parquet and DuckDB, so I claim no expertise and this may not be the best approach. Happy to see others add their suggestion for improving it.

GDB
  • 1
  • 2
  • isn't the downside of this approach is that pandas will try to load all data into memory? it's fine for 25-row files, but not when you have hundreds of gigabyte parquets... – peetonn Nov 19 '22 at 12:25