23

In pandas, how can I add a new column which enumerates rows based on a given grouping?

For instance, assume the following DataFrame:

import pandas as pd
import numpy as np

a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C']
df = pd.DataFrame({'col_a': a_list, 'col_b': range(10)})
df
  col_a  col_b
0     A      0
1     B      1
2     C      2
3     A      3
4     A      4
5     C      5
6     B      6
7     B      7
8     A      8
9     C      9

I'd like to add a col_c that gives me the Nth row of the "group" based on a grouping of col_a and sorting of col_b.

Desired output:

  col_a  col_b  col_c
0     A      0      1
3     A      3      2
4     A      4      3
8     A      8      4
1     B      1      1
6     B      6      2
7     B      7      3
2     C      2      1
5     C      5      2
9     C      9      3

I'm struggling to get to col_c. You can get to the proper grouping and sorting with .sort_index(by=['col_a', 'col_b']), it's now a matter of getting to that new column and labeling each row.

cottontail
  • 10,268
  • 18
  • 50
  • 51
Greg Reda
  • 1,744
  • 2
  • 13
  • 20

3 Answers3

40

There's cumcount, for precisely this case:

df['col_c'] = g.cumcount()

As it says in the docs:

Number each item in each group from 0 to the length of that group - 1.


Original answer (before cumcount was defined).

You could create a helper function to do this:

def add_col_c(x):
    x['col_c'] = np.arange(len(x))
    return x

First sort by column col_a:

In [11]: df.sort('col_a', inplace=True)

then apply this function across each group:

In [12]: g = df.groupby('col_a', as_index=False)

In [13]: g.apply(add_col_c)
Out[13]:
  col_a  col_b  col_c
3     A      3      0
8     A      8      1
0     A      0      2
4     A      4      3
6     B      6      0
1     B      1      1
7     B      7      2
9     C      9      0
2     C      2      1
5     C      5      2

In order to get 1,2,... you couls use np.arange(1, len(x) + 1).

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • This basically answers the question -- though gjreda asked to enumerate col_a sorted by col_b. – gabe Jun 21 '13 at 14:17
  • Awesome, thanks @AndyHayden. Knew it was going to be a groupby, but couldn't figure out how to apply it properly. – Greg Reda Jun 21 '13 at 23:43
4

The given answers both involve calling a python function for each group, and if you have many groups a vectorized approach should be faster (I havent checked).

Here is my pure numpy suggestion:

In [5]: df.sort(['col_a', 'col_b'], inplace=True, ascending=(False, False))
In [6]: sizes = df.groupby('col_a', sort=False).size().values
In [7]: df['col_c'] = np.arange(sizes.sum()) - np.repeat(sizes.cumsum() - sizes, sizes)
In [8]: print df
  col_a  col_b  col_c
9     C      9      0
5     C      5      1
2     C      2      2
7     B      7      0
6     B      6      1
1     B      1      2
8     A      8      0
4     A      4      1
3     A      3      2
0     A      0      3
andrew
  • 1,843
  • 20
  • 19
2

You could define your own function to deal with that:

In [58]: def func(x):
   ....:     x['col_c'] = x['col_a'].argsort() + 1 
   ....:     return x
   ....: 

In [59]: df.groupby('col_a').apply(func)
Out[59]: 
  col_a  col_b  col_c
0     A      0      1   
3     A      3      2   
4     A      4      3   
8     A      8      4   
1     B      1      1   
6     B      6      2   
7     B      7      3   
2     C      2      1   
5     C      5      2   
9     C      9      3
waitingkuo
  • 89,478
  • 28
  • 112
  • 118