Importing parquet file in chunks and insert in DuckDB

Question

I am trying to load the parquet file with row size group = 10 into duckdb table in chunks. I am not finding any documents to support this.

This is my work so on: see code

import duckdb
import pandas as pd
import gc
import numpy as np

# connect to an in-memory database
con = duckdb.connect(database='database.duckdb', read_only=False)

df1 = pd.read_parquet("file1.parquet")
df2 = pd.read_parquet("file2.parquet")

# create the table "my_table" from the DataFrame "df1"
con.execute("CREATE TABLE table1 AS SELECT * FROM df1")

# create the table "my_table" from the DataFrame "df2"
con.execute("CREATE TABLE table2 AS SELECT * FROM df2")

con.close()
gc.collect()

Please help me load both the tables with parquet files with row group size or chunks. ALso, load the data to duckdb as chunks

Does this answer your question? [Is it possible to read parquet files in chunks?](https://stackoverflow.com/questions/59098785/is-it-possible-to-read-parquet-files-in-chunks) — Vikram, Jan 17 '23 at 20:03

score 2 · Accepted Answer · answered Oct 22 '22 at 12:04

df1 = pd.read_parquet("file1.parquet")

This statement will read the entire parquet file into memory. Instead, I assume you want to read in chunks (i.e one row group after another or in batches) and then write the data frame into DuckDB.

This is not possible as of now using pandas. You can use something like pyarrow (or fast parquet) to do this. Here is an example from pyarrow docs.

iter_batches can be used to read streaming batches from a Parquet file. This can be used to read in batches, read certain row groups or even certain columns.

import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=10):
    print("RecordBatch")
    print(i.to_pandas())

Above example simply reads 10 records at a time. You can further limit this to certain row groups or even certain columns like below.

for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):

Hope this helps!

score 1 · Answer 2 · answered Oct 24 '22 at 16:37

This is not necessarily a solution (I like the pyarrow oriented one already submitted!), but here are some other pieces of information that may help you. I am attempting to guess what your root cause problem is! (https://xyproblem.info/)

In the next release of DuckDB (and on the current master branch), data will be written to disk in a streaming fashion for inserts. This should allow you to insert ~any size of Parquet file into a file-backed persistent DuckDB without running out of memory. Hopefully it removes the need for you to do batching at all (since DuckDB will batch based on your rowgroups automatically)! For example:

con.execute("CREATE TABLE table1 AS SELECT * FROM 'file1.parquet'")

Another note is that the typically recommended size of a rowgroup is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small rowgroups. Compression will work better, since compression operates within a rowgroup only. There will also be less overhead spent on storing statistics, since each rowgroup stores its own statistics. And, since DuckDB is quite fast, it will process a 100,000 or 1,000,000 row rowgroup quite quickly (whereas the overhead of reading statistics may slow things down with really small rowgroups).

Importing parquet file in chunks and insert in DuckDB

2 Answers2

Linked