I am reading in 64 compressed csv files (probably 70-80 GB) into one dask data frame then run groupby with aggregations.
The job never completed because appereantly the groupby creates a data frame with only one partition.
This post and this post already addressed this issue but focusing on the computational graph and not the memory issue you run into, when your resulting data frame is too large.
I tried a workaround with repartioning but the job still wont complete.
What am I doing wrong, will I have to use map_partition? This is very confusing as I expect Dask will take care of partitioning everything even after aggregation operations.
from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1, memory_limit='8GB',diagnostics_port=5000)
client
dask.config.set(scheduler='processes')
dB3 = dd.read_csv("boden/expansion*.csv", # read in parallel
blocksize=None, # 64 files
sep=',',
compression='gzip'
)
aggs = {
'boden': ['count','min']
}
dBSelect=dB3.groupby(['lng','lat']).agg(aggs).repartition(npartitions=64)
dBSelect=dBSelect.reset_index()
dBSelect.columns=['lng','lat','bodenCount','boden']
dBSelect=dBSelect.drop('bodenCount',axis=1)
with ProgressBar(dt=30): dBSelect.compute().to_parquet('boden/final/boden_final.parq',compression=None)