3

I am trying to read a single large parquet file (size > gpu_size), using dask_cudf/dask but it is currently reading it into a single partition, which i am guessing is the expected behavior inferring from the doc-string:

dask.dataframe.read_parquet(path, columns=None, filters=None, categories=None, index=None, storage_options=None, engine='auto', gather_statistics=None, **kwargs):

    Read a Parquet file into a Dask DataFrame
    This reads a directory of Parquet data into a Dask.dataframe, one file per partition. 
    It selects the index among the sorted columns if any exist.

Is there a work-around i can do read it into multiple partitions ?

Vibhu Jawa
  • 88
  • 9

1 Answers1

1

Parquet datasets can be saved into separate files. Each file may contain separate row groups. Dask Dataframe reads each Parquet row group into a separate partition.

Based on what you're saying it sounds like your dataset has only a single row group. If that is the case then unfortunately there is nothing that Dask can really do here.

You might want to go back to the source of the data to see how it was saved and verify that whatever process is saving this dataset does it in a way where it is not creating very large row groups.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Is that related to [partitioning parquet fiels](https://pandas.pydata.org/pandas-docs/version/1.0.0/user_guide/io.html#partitioning-parquet-files) from the pandas documentation? It also uses the word partitioning but it appears to be by columns, not rows. I'm trying to store a parquet dataset such (using pandas) that I can then read it with dask into multiple partitions. – gerrit Jan 29 '20 at 17:02
  • I have `pyarrow.parquet.ParquetFile("/tmp/test2.parquet").num_row_groups == 10`, but `dask.dataframe.read_parquet("/tmp/test2.parquet").npartitions == 1`. It looks like it's reading multiple row groups into a single partition? – gerrit Jan 30 '20 at 14:20
  • If that's the case then I recommend producing an [MVCE](https://stackoverflow.com/help/minimal-reproducible-example) and posting an issue at https://github.com/dask/dask/issues/new . – MRocklin Jan 30 '20 at 17:54