17

Suppose I have a Python dict of lists like so:

{'Grp': ['2'   , '6'   , '6'   , '5'   , '5'   , '6'   , '6'   , '7'   , '7'   , '6'], 
'Nums': ['6.20', '6.30', '6.80', '6.45', '6.55', '6.35', '6.37', '6.36', '6.78', '6.33']}

I can easily group the numbers and group key using itertools.groupby:

from itertools import groupby
for k, l in groupby(zip(di['Grp'], di['Nums']), key=lambda t: t[0]):
    print k, [t[1] for t in l]

Prints:

2 ['6.20']
6 ['6.30', '6.80']      # one field, key=6
5 ['6.45', '6.55']
6 ['6.35', '6.37']      # second
7 ['6.36', '6.78']
6 ['6.33']              # third

Note the 6 key is separated into three separate groups or fields .

Now suppose I have the equivalent Pandas DataFrame to my dict (same data, same list order and same keys):

  Grp  Nums
0   2  6.20
1   6  6.30
2   6  6.80
3   5  6.45
4   5  6.55
5   6  6.35
6   6  6.37
7   7  6.36
8   7  6.78
9   6  6.33

If I use Pandas' groupby I am not seeing how to get group by group iteration. Instead, Pandas groups by key value:

for e in df.groupby('Grp'):
    print e

Prints:

('2',   Grp  Nums
0   2  6.20)
('5',   Grp  Nums
3   5  6.45
4   5  6.55)
('6',   Grp  Nums
1   6  6.30            
2   6  6.80                # df['Grp'][1:2] first field
5   6  6.35                # df['Grp'][5:6] second field
6   6  6.37                 
9   6  6.33)               # df['Grp'][9] third field
('7',   Grp  Nums
7   7  6.36
8   7  6.78)

Note are the 6 group keys are bunched together; not separate groups.

My question: Is there an equivalent way to use Pandas' groupby so that 6, for example, would be in three groups in the same fashion as Python's groupby?

I tried this:

>>> df.reset_index().groupby('Grp')['index'].apply(lambda x: np.array(x))
Grp
2                [0]
5             [3, 4]
6    [1, 2, 5, 6, 9]         # I *could* do a second groupby on this...
7             [7, 8]
Name: index, dtype: object

But it is still grouped by overall Grp key and I would need to do a second groupby on the nd.array to split the sub groups of each key out.

  • 1
    Interesting question. What object should be returned in this case though? I mean, `groupby` is designed to return an object with unique keys, but that's not possible here. How do you want distinguish between duplicate keys (the different groups of 6s)? – Alex Riley Sep 20 '15 at 19:48
  • Sometimes the uninterrupted run of keys is another element of data. The presence of another key indicates something. Such as: time intervals or data readings. Once you have a different interval in the series, that is a different field. I want uniterupted runs of similar keys -- just like itertools does. –  Sep 20 '15 at 19:54
  • 3
    There's a method [here](https://groups.google.com/forum/#!msg/pydata/V3p37GRxgNY/Ca3mmXikYUwJ) that could provide a solution. – Alex Riley Sep 20 '15 at 20:02

3 Answers3

21

First you can identify which elements in the Grp column differ from the previous and get the cumulative sum to form the groups you need:

In [9]:
    diff_to_previous = df.Grp != df.Grp.shift(1)
    diff_to_previous.cumsum()
Out[9]:

0    1
1    2
2    2
3    3
4    3
5    4
6    4
7    5
8    5
9    6

So you can then do

df.groupby(diff_to_previous.cumsum()) 

to get the desired groupby object

JoeCondron
  • 8,546
  • 3
  • 27
  • 28
6

Well, not to be cheeky, but why not just use Python's groupby on the DataFrame by using iterrows? That is what it is there for:

>>> df
  Grp  Nums
0   2  6.20
1   6  6.30
2   6  6.80
3   5  6.45
4   5  6.55
5   6  6.35
6   6  6.37
7   7  6.36
8   7  6.78
9   6  6.33

>>> from itertools import groupby
>>> for k, l in groupby(df.iterrows(), key=lambda row: row[1]['Grp']):
        print k, [t[1]['Nums'] for t in l]

Prints:

2 ['6.20']
6 ['6.30', '6.80']
5 ['6.45', '6.55']
6 ['6.35', '6.37']
7 ['6.36', '6.78']
6 ['6.33']

To try and make Panda's groupby act in the way you want is probably asking for so many stacked methods that you won't be able to follow it when you reread in the future.

dawg
  • 98,345
  • 23
  • 131
  • 206
2

You basically want to create a new column to index your desired grouping order, and then use that for grouping. You keep the index number the same until the value in Grp changes.

For your data, you would want something like this:

   Grp  Nums new_group
0    2  6.20         1
1    6  6.30         2
2    6  6.80         2
3    5  6.45         3
4    5  6.55         3
5    6  6.35         4
6    6  6.37         4
7    7  6.36         5
8    7  6.78         5
9    6  6.33         6

Where you can now group on both new group and Grp:

df.groupby(['new_group', 'Grp']).Nums.groups
{(1, 2): [0],
 (2, 6): [1, 2],
 (3, 5): [3, 4],
 (4, 6): [5, 6],
 (5, 7): [7, 8],
 (6, 6): [9]

I used this method to create the new column:

df['new_group'] = None
for n, grp in enumerate(df.Grp):
if n is 0:
    df.new_group.iat[0] = 1    
elif grp == df.Grp.iat[n - 1]:
    df.new_group.iat[n] = df.new_group.iat[n - 1]
else:
    df.new_group.iat[n] = df.new_group.iat[n - 1] + 1

Note that this answer here has the same idea (thanks @ajcr for the link), but in a much more succinct representation:

>>> df.groupby((df.Grp != df.Grp.shift()).cumsum()).Nums.groups
{1: [0], 2: [1, 2], 3: [3, 4], 4: [5, 6], 5: [7, 8], 6: [9]
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • Damn! I just read this after posting my answer which I was very pleased with... I swear it was my own ; P – JoeCondron Sep 20 '15 at 20:37
  • I guess I am just starting in `Pandas` but how would I get the keys and rows from `df` using `{1: [0], 2: [1, 2], 3: [3, 4], 4: [5, 6], 5: [7, 8], 6: [9]`? I see how the pairs match, but I am having trouble translating that into a slice or somthic Pythonic familiar to me... Sorry... –  Sep 20 '15 at 22:18
  • `df.groupby(['new_group', 'Grp']).Nums.groups.keys()` dict_keys([(1, 2), (2, 6), (4, 6), (6, 6), (5, 7), (3, 5)] – Alexander Sep 20 '15 at 22:24
  • `[df.loc[(df.new_group == k1) & (df.Grp == k2), :] for k1, k2 in df.groupby(['new_group', 'Grp']).Nums.groups.keys()]` Note that groups is a dictionary with no consistent order, so you probably want to resort your list. – Alexander Sep 20 '15 at 22:27