Pandas Very Simple Percent of total size from Group by

Question

I'm having trouble for a seemingly incredibly easy operation. What is the most succint way to just get a percent of total from a group by operation such as df.groupby['col1'].size(). My DF after grouping looks like this and I just want a percent of total. I remember using a variation of this statement in the past but cannot get this to work now: percent = totals.div(totals.sum(1), axis=0)

Original DF:

       A   B   C
    0  77   3  98
    1  77  52  99
    2  77  58  61
    3  77   3  93
    4  77  31  99
    5  77  53  51
    6  77   2   9
    7  72  25  78
    8  34  41  34
    9  44  95  27

Result:

df1.groupby('A').size() / df1.groupby('A').size().sum()

    A
    34    0.1
    44    0.1
    72    0.1
    77    0.7

Here is what I came up with so far which seems pretty reasonable way to do this:

df.groupby('col1').size().apply(lambda x: float(x) / df.groupby('col1').size().sum()*100)

Are you looking for something like `df.groupby('col1').apply(lambda x: pd.Series.count(x) /len(df))`? — EdChum, Mar 27 '15 at 11:03
But isn't there a way to divide by sum from groupby operation as opposed to len of DF? Also when I tried that statement I get `unbound method count() must be called with Series instance as first argument (got DataFrame instance instead)` — horatio1701d, Mar 27 '15 at 11:33
Probably this one: `df.groupby('col1')['col_to_calc'].apply(lambda x: x / x.sum())` — Primer, Mar 27 '15 at 11:33
It's a little confusing exactly what you are asking here. It would help if you made the example data much smaller and showed exactly what results you wanted. And it would be better to show the data before grouping. — JohnE, Mar 27 '15 at 12:00
Re-did my question. All I am trying to do is create a groupby for size of occurrence but want percent of total occurrence instead. There has to be a better way then how I did it in my question. — horatio1701d, Mar 27 '15 at 12:22

score 3 · Answer 1 · answered Mar 27 '15 at 13:30

3

I don't know if I'm missing something, but looks like you could do something like this:

df.groupby('A').size() * 100 / len(df)

or

df.groupby('A').size() * 100 / df.shape[0]

answered Mar 27 '15 at 13:30

Roman Pekar

107,110
28
195
197

that is great but the only issue I believe is that if there were nulls I would have to write something a little verbose like this: `df.dropna(subset=['A']).groupby('A').size() / len(df.dropna(subset=['A']))` – horatio1701d Mar 27 '15 at 17:11

horatio1701d · Accepted Answer · 2015-03-27T18:08:08.687

1

Getting good performance (3.73s) on DF with shape (3e6,59) by using: df.groupby('col1').size().apply(lambda x: float(x) / df.groupby('col1').size().sum()*100)

edited Mar 27 '15 at 18:08

answered Mar 27 '15 at 18:02

horatio1701d

8,809
14
48
77

score 0 · Answer 3 · answered Mar 27 '15 at 17:41

0

How about:

df = pd.DataFrame({'A': {0: 77, 1: 77, 2: 77, 3: 77, 4: 77, 5: 77, 6: 77, 7: 72, 8: 34, 9: None},
                   'B': {0: 3, 1: 52, 2: 58, 3: 3, 4: 31, 5: 53, 6: 2, 7: 25, 8: 41, 9: 95},
                   'C': {0: 98, 1: 99, 2: 61, 3: 93, 4: 99, 5: 51, 6: 9, 7: 78, 8: 34, 9: 27}})

>>> df.groupby('A').size().divide(sum(df['A'].notnull()))
A
34    0.111111
72    0.111111
77    0.777778
dtype: float64

>>> df
    A   B   C
0  77   3  98
1  77  52  99
2  77  58  61
3  77   3  93
4  77  31  99
5  77  53  51
6  77   2   9
7  72  25  78
8  34  41  34
9 NaN  95  27

answered Mar 27 '15 at 17:41

Alexander

105,104
32
201
196

thanks but that's nearly 2X slower when I time it. Thanks for all of the help. I get `1 loops, best of 3: 3.73 s per loop` for the lambda version and `1 loops, best of 3: 7.29 s per loop` for using divide. – horatio1701d Mar 27 '15 at 17:57
Really? Which lambda are you referring to, and how big is your source data? Note that the second method below is faster, but milage may vary depending on data: %%timeit df.groupby('A').size().apply(lambda x: float(x) / df.groupby('A').size().sum()*100) 1000 loops, best of 3: 1.82 ms per loop %%timeit df.groupby('A').size().divide(sum(df['A'].notnull())) 1000 loops, best of 3: 555 µs per loop – Alexander Mar 27 '15 at 18:11
Here is what I get on my DF which is 3M rows and 59 columns: `df.groupby('A').size().apply(lambda x: float(x) / df.groupby('A').size().sum()) result: 1 loops, best of 3: 3.73 s per loop` and using `df.groupby('A').size().divide(sum(df['A'].notnull()))` I get `1 loops, best of 3: 7.29 s per loop` – horatio1701d Mar 27 '15 at 18:49

Pandas Very Simple Percent of total size from Group by

3 Answers3

Linked