I am learning to use Dask for parallel data processing for my university project. I connected two nodes to process data using Dask.
My data frame involves customer ID, dates, and transactions. The file has about 40GB. I used dask.dataframe to read the csv file. Here is a sample of the Dataframe.
CustomerID Date Transactions
A001 2022-1-1 1.52
A001 2022-1-2 2.05
B201 2022-1-2 8.76
A125 2022-1-2 6.28
D262 2022-1-3 7.35
Then I transformed all the dask partitions into pivot tables. Here is a sample of the pivot table:
Date 2022-1-1 2022-1-2, 2022-1-3
CustomerID
A001 1.52 2.05 0
A125 0 6.28 0
B201 0 8.76 0
D262 0 0 7.35
I need to concat pivot tables, it takes more than 2 hours to run the code, I want to know if it is possible to write the code in another way so that it can be processed parallelly by Dask? Thank you very much!
concat = pd.concat(pivots_list[:5], axis=1)
concat = concat.groupby(axis=1, level=0).sum()
for idx in range(5,len(pivots_list),5):
print(idx,idx+5)
chunk = pivots_list[idx:idx+5]+[concat]
concat = pd.concat(chunk, axis=1)
concat = concat.groupby(axis=1, level=0).sum()
concat