2

I am trying to do a groupby on a dataframe where I apply value_counts(normalize=True) and value_counts(normalize=False) on it at the same time using .agg.

However, I cannot find a way to do this without it throwing an error. I have tried multiple methods here: Multiple aggregations of the same column using pandas GroupBy.agg() but none seem to work for me. A part of the issue for me is having to pass normalize to value_counts.

I have created a test example like using this:

example = pd.DataFrame({'A': ['a','a','a','b','b','c'], 'B':[1,1,2,3,3,4]})

which gives me:

+---+---+---+
|   | A | B |
+---+---+---+
| 0 | a | 1 |
| 1 | a | 1 |
| 2 | a | 2 |
| 3 | b | 3 |
| 4 | b | 3 |
| 5 | c | 4 |
+---+---+---+

and I want to return:

A   B   False   True
a   1   2   0.666
    2   1   0.333
b   3   2   1.000
c   3   1   1.000

Doing something like:

example.groupby('A')['B'].value_counts(normalize=True)

gives me half of what I want, but I can never get the .agg to work

Thanks

Andrew
  • 539
  • 5
  • 20

1 Answers1

1

Here agg isn't great because pd.Series.value_counts returns a Series and to get the normalized result it requires an additional level of aggregation. Either concat the different value_counts or manually calculate the percent after the first groupby.

pd.concat([df.groupby('A').B.value_counts().rename('N'),
           df.groupby('A').B.value_counts(normalize=True).rename('pct')], axis=1)

# or 
res = df.groupby('A').B.value_counts().rename('N')
res = pd.concat([res, (res/res.groupby(level='A').transform('sum')).rename('pct')], axis=1)

     N       pct
A B             
a 1  2  0.666667
  2  1  0.333333
b 3  2  1.000000
c 4  1  1.000000
ALollz
  • 57,915
  • 7
  • 66
  • 89