23

I have a pandas data frame and group it by two columns (for example col1 and col2). For fixed values of col1 and col2 (i.e. for a group) I can have several different values in the col3. I would like to count the number of distinct values from the third columns.

For example, If I have this as my input:

1  1  1
1  1  1
1  1  2
1  2  3
1  2  3
1  2  3
2  1  1
2  1  2
2  1  3
2  2  3
2  2  3
2  2  3

I would like to have this table (data frame) as the output:

1  1  2
1  2  1
2  1  3
2  2  1
Roman
  • 124,451
  • 167
  • 349
  • 456
  • related topic is [.value_counts()](https://stackoverflow.com/questions/20076195/what-is-the-most-efficient-way-of-counting-occurrences-in-pandas#20076611) – cardamom Jun 19 '17 at 15:33

2 Answers2

27
df.groupby(['col1','col2'])['col3'].nunique().reset_index()
Roman
  • 124,451
  • 167
  • 349
  • 456
  • 1
    interestingly `nunique` seems twice as slow as Jeff's answer. – Andy Hayden Jul 29 '13 at 14:31
  • Weird! I am seeing that also. Groupby may be taking the wrong so-called path here -- the logic that applies functions to groups is pretty dense. – Dan Allan Jul 29 '13 at 15:52
  • 2
    there is more overhead with calling ``value_count`` (which has to reconstruct the series) on each group (rather than ``unique`` which just return an ndarray). This can actually be non-trivial. If you don't need the indexes inside the function then you can often avoid this penalty (by not instantiating the series, which value_counts does, and then gets discarded because all you need is the len of it) – Jeff Jul 29 '13 at 16:41
  • 1
    using nunique() seems to be optimal in pandas 0.18.0 – DACW Apr 21 '16 at 14:26
21
In [17]: df
Out[17]: 
    0  1  2
0   1  1  1
1   1  1  1
2   1  1  2
3   1  2  3
4   1  2  3
5   1  2  3
6   2  1  1
7   2  1  2
8   2  1  3
9   2  2  3
10  2  2  3
11  2  2  3

In [19]: df.groupby([0,1])[2].apply(lambda x: len(x.unique()))
Out[19]: 
0  1
1  1    2
   2    1
2  1    3
   2    1
dtype: int64
Jeff
  • 125,376
  • 21
  • 220
  • 187