dask dataframe groupby resulting in one partition memory issue

Question

I am reading in 64 compressed csv files (probably 70-80 GB) into one dask data frame then run groupby with aggregations.

The job never completed because appereantly the groupby creates a data frame with only one partition.

This post and this post already addressed this issue but focusing on the computational graph and not the memory issue you run into, when your resulting data frame is too large.

I tried a workaround with repartioning but the job still wont complete.

What am I doing wrong, will I have to use map_partition? This is very confusing as I expect Dask will take care of partitioning everything even after aggregation operations.

    from dask.distributed import Client, progress
    client = Client(n_workers=4, threads_per_worker=1, memory_limit='8GB',diagnostics_port=5000)
    client

    dask.config.set(scheduler='processes')
    dB3 = dd.read_csv("boden/expansion*.csv",  # read in parallel
                 blocksize=None, # 64 files
                 sep=',',
                 compression='gzip'
    )

    aggs = {
      'boden': ['count','min']
    }
    dBSelect=dB3.groupby(['lng','lat']).agg(aggs).repartition(npartitions=64) 
    dBSelect=dBSelect.reset_index()
    dBSelect.columns=['lng','lat','bodenCount','boden']
    dBSelect=dBSelect.drop('bodenCount',axis=1)
    with ProgressBar(dt=30): dBSelect.compute().to_parquet('boden/final/boden_final.parq',compression=None)

score 3 · Accepted Answer · answered Apr 27 '19 at 14:32

3

Most groupby aggregation outputs are small and fit easily in one partition. Clearly this is not the case in your situation.

To resolve this you should use the split_out= parameter to your groupby aggregation to request a certain number of output partitions.

df.groupby(['x', 'y', 'z']).mean(split_out=10)

Note that using split_out= will significantly increase the size of the task graph (it has to mildly shuffle/sort your data ahead of time) and so may increase scheduling overhead.

answered Apr 27 '19 at 14:32

MRocklin

55,641
23
163
235

thanks @MRocklin - my workaround was using map_partition and apply function to aggregate - however, I am not sure if this would really cleanly do it correctly – user670186 Apr 27 '19 at 19:58

dask dataframe groupby resulting in one partition memory issue

1 Answers1

Linked