7

I have a pandas data frame. I want to group it by using one combination of columns and count distinct values of another combination of columns.

For example I have the following data frame:

   a   b    c     d      e
0  1  10  100  1000  10000
1  1  10  100  1000  20000
2  1  20  100  1000  20000
3  1  20  100  2000  20000

I can group it by columns a and b and count distinct values in the column d:

df.groupby(['a','b'])['d'].nunique().reset_index()

As a result I get:

   a   b  d
0  1  10  1
1  1  20  2

However, I would like to count distinct values in a combination of columns. For example if I use c and d, then in the first group I have only one unique combination ((100, 1000)) while in the second group I have two distinct combinations: (100, 1000) and (100, 2000).

The following naive "generalization" does not work:

df.groupby(['a','b'])[['c','d']].nunique().reset_index()

because nunique() is not applicable to data frames.

Roman
  • 124,451
  • 167
  • 349
  • 456
  • 1
    could you not simply create a column `f` as a combination of `c` and `d`? – Nils Gudat Nov 25 '16 at 13:09
  • @NilsGudat, yes I though about this approach but I was not sure if it is a correct way to go. I have the following concerns. I have numeric values in both columns, how should I construct unique values out of them? I could transfer them to strings and then concatenate strings but it might be too slow. – Roman Nov 25 '16 at 13:11
  • None of the answers worked for you? ;) – IanS Nov 28 '16 at 17:56

4 Answers4

10

You can create combination of values converting to string to new column e and then use SeriesGroupBy.nunique:

df['e'] = df.c.astype(str) + df.d.astype(str)
df = df.groupby(['a','b'])['e'].nunique().reset_index()
print (df)
   a   b  e
0  1  10  1
1  1  20  2

You can also use Series without creating new column:

df =(df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique().reset_index(name='f')
print (df)
   a   b  f
0  1  10  1
1  1  20  2

Another posible solution is create tuples:

df=(df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique().reset_index(name='f')
print (df)
   a   b  f
0  1  10  1
1  1  20  2

Another numpy solution by this answer:

def f(x):
    a = x.values
    c = len(np.unique(np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1]))), return_counts=True)[1])
    return c

print (df.groupby(['a','b'])[['c','d']].apply(f))

Timings:

#[1000000 rows x 5 columns]
np.random.seed(123)
N = 1000000
df = pd.DataFrame(np.random.randint(30, size=(N,5)))
df.columns = list('abcde')
print (df)  

In [354]: %timeit (df.groupby(['a','b'])[['c','d']].apply(lambda g: len(g) - g.duplicated().sum()))
1 loop, best of 3: 663 ms per loop

In [355]: %timeit (df.groupby(['a','b'])[['c','d']].apply(f))
1 loop, best of 3: 387 ms per loop

In [356]: %timeit (df.groupby(['a', 'b', 'c', 'd']).size().groupby(level=['a', 'b']).size())
1 loop, best of 3: 441 ms per loop

In [357]: %timeit ((df.c.astype(str)+df.d.astype(str)).groupby([df.a, df.b]).nunique())
1 loop, best of 3: 4.95 s per loop

In [358]: %timeit ((df[['c','d']].apply(tuple, axis=1)).groupby([df.a, df.b]).nunique())
1 loop, best of 3: 17.6 s per loop
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
7

Don't stop the groupby at 'a', 'b', include everything you're looking at.

df.groupby(['a', 'b', 'c', 'd']).size()

a  b   c    d   
1  10  100  1000    2
   20  100  1000    1
            2000    1
dtype: int64

unstack to get a different view

df.groupby(['a', 'b', 'c', 'd']).size().unstack(fill_value=0)

enter image description here


to actually get at the information you're looking for

df.groupby(['a', 'b', 'c', 'd']).size().groupby(level=['a', 'b']).size()

a  b 
1  10    1
   20    2
dtype: int64

Says:
within the the group a is 1; b is 10 there are 1 unique combinations of c and d.
within the the group a is 1; b is 20 there are 2 unique combinations of c and d.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
3

If you do not want to concatenate columns, you can apply a function that counts the number of non-duplicates:

df.groupby(['a','b'])[['c','d']].apply(lambda g: len(g) - g.duplicated().sum())
IanS
  • 15,771
  • 9
  • 60
  • 84
1

You can extend your original concept...

df.groupby(['a', 'b', 'c']).d.nunique()

a  b   c  
1  10  100    1
   20  100    2
Name: d, dtype: int64

You can drop the c in the index and focus on just the information you want.

df.groupby(['a', 'b', 'c']).d.nunique().reset_index('c', drop=True)

a  b 
1  10    1
   20    2
Name: d, dtype: int64
  • 1
    I think in you example you calculate unique values just in one column ("d") and I want to be able to calculate unique combinations of values from several columns. So, my original example is more general. – Roman Nov 25 '16 at 14:20
  • That's the thing. Using the group by that includes one of your columns and calculating `nunique` on the other one, does exactly that. –  Nov 25 '16 at 14:30
  • what if for ´a, b, c = 1, 10, 100´ I have 2 unique values of d and for ´a, b, c = 1, 10, 200´ I have 3 unique values of d. Now I should group by `a and b` again and sum up 2 and 3 to find out that for `a, b = 1, 10` I have 5 unique combinations of c and d. – Roman Nov 25 '16 at 15:10