How to accomplish such a groupby-size task on a resource-limited machine?
My code looks like this:
import dask.dataframe as dd
ddf = dd.read_parquet(parquet_path)
sr = ddf.groupby(["col_1", "col_2"]).size()
sr.to_csv(csv_path)
My data:
- The parquet file is around 7GB and 300M records in total. Furthermore, it is expected to be 3 times bigger after more data is appended.
- The parquet file consists of 30 parts with each part being around 235MB. The parts are output by batch using
.to_parquet(append=True)
on the same machine, so I didn't run into memory issues when generating the data. - Both
col_1
andcol_2
have data typeuint64
.
The code worked correctly with a small sample but failed on a large sample. I don't know what options do I have for accomplishing such a task with an ordinary Win10 laptop having only 12GB memory installed.