I have pandas dataframe of size (607875, 12294)
. The data is sparse and looks like:
ID BB CC DD ...
0 abc 0 0 1 ...
1 bcd 0 0 0 ...
2 abc 0 0 1 ...
...
I converted it to the sparse form by calling
dataframe = dataframe.to_sparse()
Later, I groupped it by ID
and sum
the row values by
dataframe = dataframe.groupby("ID").sum()
For smaller dataframes it works perfectly well, but for this size, it worked for an hour and did not finish the work.
Is there a way to speed it up or get around it? Is there any other sparse methods I can use because the to_sparse
method is deprecated.
The size of output dataframe would be (2000, 12294)
and look like (if there is no other 1 in abc
column):
ID BB CC DD ...
0 abc 0 0 2 ...
1 bcd 0 0 0 ...
...
I have 32 GB RAM on my PC, so it should be enough.