I have a Dask dataframe that I load like this:
dates_devices = dd.read_csv('data_part*.csv.gz', compression='gzip', blocksize=None)
dates_devices['cnt'] = 1
dates_devices.astype({'cnt': 'uint8'}).dtypes # make it smaller
I'm trying to use dask to pivot the table:
final_table = (dates_devices
.categorize(columns=['date'])
.pivot_table(index='device',
columns='date',
values='cnt').fillna(0).astype('uint8'))
That "runs" just fine, but when I do the implicit compute in the dd.to_parquet()
I get:
MemoryError: Unable to allocate 5.42 GiB for an array with shape (727304656,) and data type uint8
,
then taken from here, I tried
`$ echo 1 > /proc/sys/vm/overcommit_memory`
but the kernel still gets killed. I've 32GB RAM, and 32GB swap on Linux Xubuntu, so that should fit comfortably in RAM. Is there a way to do this or to "test" why exactly I'm getting my kernel killed?