Supposing I have this dataframe on a large hdf5 file
A B C
0 103896 1 2.0
1 103896 1 0.0
2 103896 1 5.0
3 103896 2 0.0
4 103896 2 7.0
5 103896 2 0
6 103896 2 0.0
7 103897 1 7.0
8 103897 1 0
Based on the first two columns, I would like to create a final list like this:
[[2.0, 0.0, 5.0], [0.0, 7.0, 0, 0.0], [7.0, 0]]
In order to do that, the only way I thought was:
df = df.groupby(['A', 'B'])['C'].apply(list)
Then iterating over df or just convert:
final_list = df['C']
Is there some another approach that don't return a new dataframe after groupby, taking into consideration that my dataset is large ?
Doing this with dask is quit expensive for computer memory, because is not a real reduction:
df = df.groupby(['A', 'B'])['C'].apply(list, meta=(float)).compute()