5

Here's a sample dataframe:

label  data
a      1.09
b      2.1
a      5.0
b      2.0
c      1.9

What I want is

arr = [[1.09, 5.0], [2.1, 2.0],[1.9]]

preferably as a list of numpy arrays.

I know that df.groupby.groups.keys() gives me the list ['a','b','c'], and df.groupby.groups.values() gives me something like arr, but as an Int64Index object. However, I tried df.loc[df.groupby.groups.values()]['label'] and it isn't getting the desired result.

How do I accomplish this? Thanks!

irene
  • 2,085
  • 1
  • 22
  • 36

1 Answers1

8

preferably as a list of numpy arrays.

Preferably not, because you're asking for ragged arrays, which means that the inner arrays (AKA, the rows) are not all of the same length. This is inconvenient for numpy, meaning it cannot store these arrays efficiently as C arrays internally. It ends up falling back to slow python objects.

In this situation, I'd recommend nested python lists. That's achievable through a groupby + apply.

lst = df.groupby('label')['data'].apply(pd.Series.tolist).tolist()
print(lst)
[[1.09, 5.0], [2.1, 2.0], [1.9]]
Joe
  • 12,057
  • 5
  • 39
  • 55
cs95
  • 379,657
  • 97
  • 704
  • 746
  • I'm getting this error though: `AttributeError: 'DataFrameGroupBy' object has no attribute 'data'` – irene Jun 21 '18 at 06:09
  • @irene ummm it's supposed to be the name of your column? Try again with a slightly different syntax please (edit made ^). – cs95 Jun 21 '18 at 06:10
  • Ain't this the similar one ? https://stackoverflow.com/questions/22219004/grouping-rows-in-list-in-pandas-groupby – Bharath M Shetty Jun 21 '18 at 06:15
  • 1
    @Dark yup. I'll keep this one here because of the little primer on arrays :) – cs95 Jun 21 '18 at 06:17
  • Oh haha I see. Thanks! Also, is it guaranteed that this follows the order in `df.groupby('label').groups.keys()`? @coldspeed – irene Jun 21 '18 at 06:28
  • @coldspeed Also, it's not a duplicate of https://stackoverflow.com/questions/22219004/grouping-rows-in-list-in-pandas-groupby...I want to get a list of lists, the other is looking for a dataframe. Thanks for the answer though. – irene Jun 21 '18 at 06:31
  • @irene Hmm, that's a good question. I think the order is guaranteed when you do `df.groupby('label', sort=False).groups.keys()`. – cs95 Jun 21 '18 at 06:32
  • @coldspeed Alternatively, can I try `df.groupby('label', sort=True)['data'].apply(pd.Series.tolist).tolist()`? Would that be a safer option? – irene Jun 21 '18 at 06:36
  • @irene Safer? Depends on what you want? But with `sort=True` you are guaranteed the same order for the same keys. – cs95 Jun 21 '18 at 06:39
  • @coldspeed I just want to make sure that I know which key corresponds to which array in the list. Thanks! – irene Jun 21 '18 at 06:40