0

Supposing I have this dataframe on a large hdf5 file

      A     B    C
0   103896  1   2.0
1   103896  1   0.0
2   103896  1   5.0
3   103896  2   0.0
4   103896  2   7.0
5   103896  2   0
6   103896  2   0.0
7   103897  1   7.0
8   103897  1   0

Based on the first two columns, I would like to create a final list like this:

[[2.0, 0.0, 5.0], [0.0, 7.0, 0, 0.0], [7.0, 0]]

In order to do that, the only way I thought was:

 df = df.groupby(['A', 'B'])['C'].apply(list)

Then iterating over df or just convert:

final_list = df['C']

Is there some another approach that don't return a new dataframe after groupby, taking into consideration that my dataset is large ?

Doing this with dask is quit expensive for computer memory, because is not a real reduction:

df = df.groupby(['A', 'B'])['C'].apply(list, meta=(float)).compute()
Cesar
  • 575
  • 3
  • 16

1 Answers1

1

We can use:

[list(c) for i,c in  df.groupby(['A','B'])['C']]
#[[2.0, 0.0, 5.0], [0.0, 7.0, 0.0, 0.0], [7.0, 0.0]]

or

df.groupby(['A', 'B'])['C'].apply(list).tolist()
#[[2.0, 0.0, 5.0], [0.0, 7.0, 0.0, 0.0], [7.0, 0.0]]

Time comparision:


%%timeit
[list(c) for i,c in  df.groupby(['A','B'])['C']]
1.82 ms ± 93.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df.groupby(['A', 'B'])['C'].apply(list).tolist()
3.38 ms ± 473 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ansev
  • 30,322
  • 5
  • 17
  • 31
  • Do you know how to do this [list(c) for i,c in df.groupby(['A','B'])['C']] using dask, in order to avoid load the entire dataset on memory ? – Cesar Nov 28 '19 at 14:01