I'm trying to take onehot encode a dataset then groupby a specific column so I can get one row for each item in that column with a aggregated view of what onehot columns are true for that specific row. It seems to be working on small data and using dask seems to work for large datasets but I'm having problems when I'm trying to save the file. I've tried CSV and parquet file. I want to save the results and then I can open it later in chunks.
Here's code to show the issue(the below script generates 2M rows and up to 30k unique values to onehot encode).
import pandas as pd
import numpy as np
import dask.dataframe as dd
from dask.distributed import Client, LocalCluster, wait
sizeOfRows = 2000000
columnsForDF = 30000
partitionsforDask = 500
print("partition is ", partitionsforDask)
cluster = LocalCluster()
client = Client(cluster)
print(client)
df = pd.DataFrame(np.random.randint(0,columnsForDF,size=(sizeOfRows, 2)), columns=list('AB'))
ddf = dd.from_pandas(df, npartitions=partitionsforDask)
# ddf = ddf.persist()
wait(ddf)
# %%time
# need to globally know the categories before one hot encoding
ddf = ddf.categorize(columns=["B"])
one_hot = dd.get_dummies(ddf, columns=['B'])
print("starting groupby")
# result = one_hot.groupby('A').max().persist() # or to_parquet/to_csv/compute/etc.
# result = one_hot.groupby('A', sort=False).max().to_csv('./daskDF.csv', single_file = True)
result = one_hot.groupby('A', sort=False).max().to_parquet('./parquetFile')
wait(result)
It seems to work until it does the groupby to csv or parquet. At that point, I get many errors about workers exceeded 95% of memory and then the program exits with a "killedworker" exception:
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
KilledWorker: ("('dataframe-groupby-max-combine-3ddcd8fc854613101b4bdc7fccde32cd', 1, 0, 0)", <Worker 'tcp://127.0.0.1:33815', name: 6, memory: 0, processing: 22>)
Monitoring my machine, I never get close to exceeding memory and my drive space is over 300 GB which is never used(no file is created during this process although it's in the groupby section).
What can I do?
Update - I thought I'd add an award. I'm having the same problem with .to_csv as well, since someone else had a similar problem I hope it has value for a wide audience.