6

To read a parquet file into multiple partitions, it should be stored using row groups (see How to read a single large parquet file into multiple partitions using dask/dask-cudf?). The pandas documentation describes partitioning of columns, the pyarrow documentation describes how to write multiple row groups. Using the pandas DataFrame .to_parquet method, can I interface the ability to write multiple row groups, or will it always write to a single partition? If yes, how?

Although the dataset is small (currently only 3 GB), I want to read into multiple partitions such that subsequent processing using dask will use multiple cores (I can repartition, but this creates additional overhead) (and I might work with datasets of some 10s of GB later, still small but too large for RAM).

gerrit
  • 24,025
  • 17
  • 97
  • 170

2 Answers2

6

You can simply provide the keyword argument row_group_size when using pyarrow. Note that pyarrow is the default engine.

df.to_parquet("filename.parquet", row_group_size=500, engine="pyarrow")
JulianWgs
  • 961
  • 1
  • 14
  • 25
2

alternative answer for folks using fastparquet instead of pyarrow. fastparquet provides su functionality via a differently named parameter row_group_offsets

df.to_parquet("filename.parquet", row_group_offsets=500, engine='fastparquet')

From the documentation on row_groups_offsets (int or list of int:

If int, row-groups will be approximately this many rows, rounded down to make row groups about the same size; If a list, the explicit index values to start new row groups; If None, set to 50_000_000. In case of partitioning the data, final row-groups size can be reduced significantly further by the partitioning, occuring as a subsequent step.

Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85