Make Pandas groupby act similarly to itertools groupby

Question

Suppose I have a Python dict of lists like so:

{'Grp': ['2'   , '6'   , '6'   , '5'   , '5'   , '6'   , '6'   , '7'   , '7'   , '6'], 
'Nums': ['6.20', '6.30', '6.80', '6.45', '6.55', '6.35', '6.37', '6.36', '6.78', '6.33']}

I can easily group the numbers and group key using itertools.groupby:

from itertools import groupby
for k, l in groupby(zip(di['Grp'], di['Nums']), key=lambda t: t[0]):
    print k, [t[1] for t in l]

Prints:

2 ['6.20']
6 ['6.30', '6.80']      # one field, key=6
5 ['6.45', '6.55']
6 ['6.35', '6.37']      # second
7 ['6.36', '6.78']
6 ['6.33']              # third

Note the 6 key is separated into three separate groups or fields .

Now suppose I have the equivalent Pandas DataFrame to my dict (same data, same list order and same keys):

  Grp  Nums
0   2  6.20
1   6  6.30
2   6  6.80
3   5  6.45
4   5  6.55
5   6  6.35
6   6  6.37
7   7  6.36
8   7  6.78
9   6  6.33

If I use Pandas' groupby I am not seeing how to get group by group iteration. Instead, Pandas groups by key value:

for e in df.groupby('Grp'):
    print e

Prints:

('2',   Grp  Nums
0   2  6.20)
('5',   Grp  Nums
3   5  6.45
4   5  6.55)
('6',   Grp  Nums
1   6  6.30            
2   6  6.80                # df['Grp'][1:2] first field
5   6  6.35                # df['Grp'][5:6] second field
6   6  6.37                 
9   6  6.33)               # df['Grp'][9] third field
('7',   Grp  Nums
7   7  6.36
8   7  6.78)

Note are the 6 group keys are bunched together; not separate groups.

My question: Is there an equivalent way to use Pandas' groupby so that 6, for example, would be in three groups in the same fashion as Python's groupby?

I tried this:

>>> df.reset_index().groupby('Grp')['index'].apply(lambda x: np.array(x))
Grp
2                [0]
5             [3, 4]
6    [1, 2, 5, 6, 9]         # I *could* do a second groupby on this...
7             [7, 8]
Name: index, dtype: object

But it is still grouped by overall Grp key and I would need to do a second groupby on the nd.array to split the sub groups of each key out.

Interesting question. What object should be returned in this case though? I mean, `groupby` is designed to return an object with unique keys, but that's not possible here. How do you want distinguish between duplicate keys (the different groups of 6s)? — Alex Riley, Sep 20 '15 at 19:48
Sometimes the uninterrupted run of keys is another element of data. The presence of another key indicates something. Such as: time intervals or data readings. Once you have a different interval in the series, that is a different field. I want uniterupted runs of similar keys -- just like itertools does. — , Sep 20 '15 at 19:54
There's a method [here](https://groups.google.com/forum/#!msg/pydata/V3p37GRxgNY/Ca3mmXikYUwJ) that could provide a solution. — Alex Riley, Sep 20 '15 at 20:02

score 21 · Answer 1 · answered Sep 20 '15 at 20:34

First you can identify which elements in the Grp column differ from the previous and get the cumulative sum to form the groups you need:

In [9]:
    diff_to_previous = df.Grp != df.Grp.shift(1)
    diff_to_previous.cumsum()
Out[9]:

0    1
1    2
2    2
3    3
4    3
5    4
6    4
7    5
8    5
9    6

So you can then do

df.groupby(diff_to_previous.cumsum())

to get the desired groupby object

dawg · Accepted Answer · 2015-09-21T04:23:24.763

6

Well, not to be cheeky, but why not just use Python's groupby on the DataFrame by using iterrows? That is what it is there for:

>>> df
  Grp  Nums
0   2  6.20
1   6  6.30
2   6  6.80
3   5  6.45
4   5  6.55
5   6  6.35
6   6  6.37
7   7  6.36
8   7  6.78
9   6  6.33

>>> from itertools import groupby
>>> for k, l in groupby(df.iterrows(), key=lambda row: row[1]['Grp']):
        print k, [t[1]['Nums'] for t in l]

Prints:

2 ['6.20']
6 ['6.30', '6.80']
5 ['6.45', '6.55']
6 ['6.35', '6.37']
7 ['6.36', '6.78']
6 ['6.33']

To try and make Panda's groupby act in the way you want is probably asking for so many stacked methods that you won't be able to follow it when you reread in the future.

edited Sep 21 '15 at 04:23

answered Sep 20 '15 at 21:44

dawg

98,345
23
131
206

4

To answer your question, it is probably slower. – The Unfun Cat Oct 30 '15 at 13:11
@TheUnfunCat: Probably slower to execute on larger data frames true, but the alternatives are probably slower for the programmer to write. – dawg Oct 30 '15 at 18:02

Alexander · Answer 3 · 2015-09-20T20:34:49.800

2

You basically want to create a new column to index your desired grouping order, and then use that for grouping. You keep the index number the same until the value in Grp changes.

For your data, you would want something like this:

   Grp  Nums new_group
0    2  6.20         1
1    6  6.30         2
2    6  6.80         2
3    5  6.45         3
4    5  6.55         3
5    6  6.35         4
6    6  6.37         4
7    7  6.36         5
8    7  6.78         5
9    6  6.33         6

Where you can now group on both new group and Grp:

df.groupby(['new_group', 'Grp']).Nums.groups
{(1, 2): [0],
 (2, 6): [1, 2],
 (3, 5): [3, 4],
 (4, 6): [5, 6],
 (5, 7): [7, 8],
 (6, 6): [9]

I used this method to create the new column:

df['new_group'] = None
for n, grp in enumerate(df.Grp):
if n is 0:
    df.new_group.iat[0] = 1    
elif grp == df.Grp.iat[n - 1]:
    df.new_group.iat[n] = df.new_group.iat[n - 1]
else:
    df.new_group.iat[n] = df.new_group.iat[n - 1] + 1

Note that this answer here has the same idea (thanks @ajcr for the link), but in a much more succinct representation:

>>> df.groupby((df.Grp != df.Grp.shift()).cumsum()).Nums.groups
{1: [0], 2: [1, 2], 3: [3, 4], 4: [5, 6], 5: [7, 8], 6: [9]

edited Sep 20 '15 at 20:34

answered Sep 20 '15 at 20:28

Alexander

105,104
32
201
196

Damn! I just read this after posting my answer which I was very pleased with... I swear it was my own ; P – JoeCondron Sep 20 '15 at 20:37
I guess I am just starting in `Pandas` but how would I get the keys and rows from `df` using `{1: [0], 2: [1, 2], 3: [3, 4], 4: [5, 6], 5: [7, 8], 6: [9]`? I see how the pairs match, but I am having trouble translating that into a slice or somthic Pythonic familiar to me... Sorry... – Sep 20 '15 at 22:18
`df.groupby(['new_group', 'Grp']).Nums.groups.keys()` dict_keys([(1, 2), (2, 6), (4, 6), (6, 6), (5, 7), (3, 5)] – Alexander Sep 20 '15 at 22:24
`[df.loc[(df.new_group == k1) & (df.Grp == k2), :] for k1, k2 in df.groupby(['new_group', 'Grp']).Nums.groups.keys()]` Note that groups is a dictionary with no consistent order, so you probably want to resort your list. – Alexander Sep 20 '15 at 22:27

Make Pandas groupby act similarly to itertools groupby

3 Answers3

Linked

Related