Control row groups with pandas.DataFrame.to_parquet

Question

To read a parquet file into multiple partitions, it should be stored using row groups (see How to read a single large parquet file into multiple partitions using dask/dask-cudf?). The pandas documentation describes partitioning of columns, the pyarrow documentation describes how to write multiple row groups. Using the pandas DataFrame .to_parquet method, can I interface the ability to write multiple row groups, or will it always write to a single partition? If yes, how?

Although the dataset is small (currently only 3 GB), I want to read into multiple partitions such that subsequent processing using dask will use multiple cores (I can repartition, but this creates additional overhead) (and I might work with datasets of some 10s of GB later, still small but too large for RAM).

score 6 · Answer 1 · answered Apr 29 '20 at 10:41

6

You can simply provide the keyword argument row_group_size when using pyarrow. Note that pyarrow is the default engine.

df.to_parquet("filename.parquet", row_group_size=500, engine="pyarrow")

answered Apr 29 '20 at 10:41

JulianWgs

961
1
14
25

score 2 · Answer 2 · answered Jul 25 '22 at 20:36

alternative answer for folks using fastparquet instead of pyarrow. fastparquet provides su functionality via a differently named parameter row_group_offsets

df.to_parquet("filename.parquet", row_group_offsets=500, engine='fastparquet')

From the documentation on row_groups_offsets (int or list of int:

If int, row-groups will be approximately this many rows, rounded down to make row groups about the same size; If a list, the explicit index values to start new row groups; If None, set to 50_000_000. In case of partitioning the data, final row-groups size can be reduced significantly further by the partitioning, occuring as a subsequent step.

Control row groups with pandas.DataFrame.to_parquet

2 Answers2