How can I read each Parquet row group into a separate partition?

Question

I have a parquet file with 10 row groups:

In [30]: print(pyarrow.parquet.ParquetFile("/tmp/test2.parquet").num_row_groups)
10

But when I load it using Dask Dataframe, it is read into a single partition:

In [31]: print(dask.dataframe.read_parquet("/tmp/test2.parquet").npartitions)
1

This appears to contradict this answer, which states that Dask Dataframe reads each Parquet row group into a separate partition.

How can I read each Parquet row group into a separate partition with Dask Dataframe? Or must the data be distributed over different files for this to work?

score 3 · Answer 1 · answered Jan 30 '20 at 15:00

I believe that fastparquet will read each row-group separately, and the fact that pyarrow apparently doesn't could be considered bug or at least a feature enhancement that you could request on the dask issues tracker. I would tend to agree that a set of files containing one row-group each and a single file containing the same row-groups should result in the same partition structure.

score 2 · Answer 2 · answered Mar 03 '21 at 20:35

2

I can read using the batches with pyarrow.

import pyarrow as pq
batch_size = 1
_file = pq.parquet.ParquetFile("file.parquet")
batches = _file.iter_batches(batch_size) #batches will be a generator

for batch in batches:
  process(batch)

answered Mar 03 '21 at 20:35

Angelo Mendes

905
13
24

How can I read each Parquet row group into a separate partition?

2 Answers2