2

I am learning to use Dask for parallel data processing for my university project. I connected two nodes to process data using Dask.

My data frame involves customer ID, dates, and transactions. The file has about 40GB. I used dask.dataframe to read the csv file. Here is a sample of the Dataframe.

CustomerID    Date      Transactions
A001          2022-1-1  1.52
A001          2022-1-2  2.05
B201          2022-1-2  8.76
A125          2022-1-2  6.28
D262          2022-1-3  7.35
                   

Then I transformed all the dask partitions into pivot tables. Here is a sample of the pivot table:

          Date 2022-1-1 2022-1-2, 2022-1-3
CustomerID
A001            1.52     2.05       0
A125            0        6.28       0
B201            0        8.76       0
D262            0        0          7.35


I need to concat pivot tables, it takes more than 2 hours to run the code, I want to know if it is possible to write the code in another way so that it can be processed parallelly by Dask? Thank you very much!

concat = pd.concat(pivots_list[:5], axis=1)
concat = concat.groupby(axis=1, level=0).sum()
for idx in range(5,len(pivots_list),5):
  print(idx,idx+5)
  chunk = pivots_list[idx:idx+5]+[concat]
  concat = pd.concat(chunk, axis=1)
  concat = concat.groupby(axis=1, level=0).sum()

concat

cs201503
  • 21
  • 2
  • Thanks for the question! Generally, operations which resize the arrays are really slow (and best done in one big operation rather than with lots of small ones), and so it's recommended that if you need to concatenate lots of dataframes together, you should just append them to a list and then concatenate them into a large array once at the end. If you structure it this way, there's also a chance you could parallelize the operation with dask. But as your code is currently written, each iteration of the loop depends on all previous iterations, so there's no way to parallelize this. – Michael Delgado May 27 '22 at 17:56
  • It would be much easier for us to help you, though, if you could explain more about what you're trying to do with the data, what the data looks like (e.g. what is pivot_list?) e.g. by pasting the result of `print(pivot_list)` as a code block, or better yet, generating a dummy example with code. See the guide to [creating reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Michael Delgado May 27 '22 at 17:58
  • Thank you for your reply @MichaelDelgado ! I added a sample of the dataframe and a sample of the pivot table. Based on your comment, I guess this is already the fastest way to concat? – cs201503 May 28 '22 at 08:31
  • No not at all! You’re concatenating inside a loop. Much faster to do this once at the end of a loop. See e.g. https://stackoverflow.com/a/71258620/3888719 – Michael Delgado May 28 '22 at 14:50

1 Answers1

1

Thanks for clarifying this.

If you’re just trying to group by customerID and Date and sum you should use groupby rather than pivot.

# with the dask dataframe (don’t pivot)
grouped = df.groupby(['CustomerID', 'Date']).sum()

If you then want the data with date columns you could Compute and unstack columns. But you should definitely be careful - it seems like this is going to explode your memory unless you have a small number of total dates with a large number of transactions by each customer on each date. I’d carefully inspect the dask object before computing

Michael Delgado
  • 13,789
  • 3
  • 29
  • 54