5

I have a parquet file with 10 row groups:

In [30]: print(pyarrow.parquet.ParquetFile("/tmp/test2.parquet").num_row_groups)
10

But when I load it using Dask Dataframe, it is read into a single partition:

In [31]: print(dask.dataframe.read_parquet("/tmp/test2.parquet").npartitions)
1

This appears to contradict this answer, which states that Dask Dataframe reads each Parquet row group into a separate partition.

How can I read each Parquet row group into a separate partition with Dask Dataframe? Or must the data be distributed over different files for this to work?

gerrit
  • 24,025
  • 17
  • 97
  • 170

2 Answers2

3

I believe that fastparquet will read each row-group separately, and the fact that pyarrow apparently doesn't could be considered bug or at least a feature enhancement that you could request on the dask issues tracker. I would tend to agree that a set of files containing one row-group each and a single file containing the same row-groups should result in the same partition structure.

mdurant
  • 27,272
  • 5
  • 45
  • 74
2

I can read using the batches with pyarrow.

import pyarrow as pq
batch_size = 1
_file = pq.parquet.ParquetFile("file.parquet")
batches = _file.iter_batches(batch_size) #batches will be a generator

for batch in batches:
  process(batch)

Angelo Mendes
  • 905
  • 13
  • 24