How can I pivot a really large dataframe using dask?

Question

I have a Dask dataframe that I load like this:

dates_devices = dd.read_csv('data_part*.csv.gz', compression='gzip', blocksize=None) 
dates_devices['cnt'] = 1
dates_devices.astype({'cnt': 'uint8'}).dtypes # make it smaller

I'm trying to use dask to pivot the table:

final_table = (dates_devices
               .categorize(columns=['date'])
               .pivot_table(index='device', 
                            columns='date', 
                            values='cnt').fillna(0).astype('uint8'))

That "runs" just fine, but when I do the implicit compute in the dd.to_parquet() I get:

MemoryError: Unable to allocate 5.42 GiB for an array with shape (727304656,) and data type uint8,

then taken from here, I tried

`$ echo 1 > /proc/sys/vm/overcommit_memory`

but the kernel still gets killed. I've 32GB RAM, and 32GB swap on Linux Xubuntu, so that should fit comfortably in RAM. Is there a way to do this or to "test" why exactly I'm getting my kernel killed?

score 1 · Answer 1 · answered Sep 28 '20 at 04:07

You can can try writing your dask dataframes in chunks to overcome the memory limitation: For example:

for i in range(final_table.npartitions):
    partition = final_table.get_partition(i)`

Please see how I do .to_sql -- you can take a similar approach with .to_parquet: https://stackoverflow.com/a/62458085/6366770

How can I pivot a really large dataframe using dask?

1 Answers1

Linked