2

I'm having trouble for a seemingly incredibly easy operation. What is the most succint way to just get a percent of total from a group by operation such as df.groupby['col1'].size(). My DF after grouping looks like this and I just want a percent of total. I remember using a variation of this statement in the past but cannot get this to work now: percent = totals.div(totals.sum(1), axis=0)

Original DF:

       A   B   C
    0  77   3  98
    1  77  52  99
    2  77  58  61
    3  77   3  93
    4  77  31  99
    5  77  53  51
    6  77   2   9
    7  72  25  78
    8  34  41  34
    9  44  95  27

Result:

df1.groupby('A').size() / df1.groupby('A').size().sum()

    A
    34    0.1
    44    0.1
    72    0.1
    77    0.7

Here is what I came up with so far which seems pretty reasonable way to do this:

df.groupby('col1').size().apply(lambda x: float(x) / df.groupby('col1').size().sum()*100)
horatio1701d
  • 8,809
  • 14
  • 48
  • 77
  • Are you looking for something like `df.groupby('col1').apply(lambda x: pd.Series.count(x) /len(df))`? – EdChum Mar 27 '15 at 11:03
  • But isn't there a way to divide by sum from groupby operation as opposed to len of DF? Also when I tried that statement I get `unbound method count() must be called with Series instance as first argument (got DataFrame instance instead)` – horatio1701d Mar 27 '15 at 11:33
  • Probably this one: `df.groupby('col1')['col_to_calc'].apply(lambda x: x / x.sum())` – Primer Mar 27 '15 at 11:33
  • It's a little confusing exactly what you are asking here. It would help if you made the example data much smaller and showed exactly what results you wanted. And it would be better to show the data before grouping. – JohnE Mar 27 '15 at 12:00
  • Re-did my question. All I am trying to do is create a groupby for size of occurrence but want percent of total occurrence instead. There has to be a better way then how I did it in my question. – horatio1701d Mar 27 '15 at 12:22

3 Answers3

3

I don't know if I'm missing something, but looks like you could do something like this:

df.groupby('A').size() * 100 / len(df)

or

df.groupby('A').size() * 100 / df.shape[0]
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • that is great but the only issue I believe is that if there were nulls I would have to write something a little verbose like this: `df.dropna(subset=['A']).groupby('A').size() / len(df.dropna(subset=['A']))` – horatio1701d Mar 27 '15 at 17:11
1

Getting good performance (3.73s) on DF with shape (3e6,59) by using: df.groupby('col1').size().apply(lambda x: float(x) / df.groupby('col1').size().sum()*100)

horatio1701d
  • 8,809
  • 14
  • 48
  • 77
0

How about:

df = pd.DataFrame({'A': {0: 77, 1: 77, 2: 77, 3: 77, 4: 77, 5: 77, 6: 77, 7: 72, 8: 34, 9: None},
                   'B': {0: 3, 1: 52, 2: 58, 3: 3, 4: 31, 5: 53, 6: 2, 7: 25, 8: 41, 9: 95},
                   'C': {0: 98, 1: 99, 2: 61, 3: 93, 4: 99, 5: 51, 6: 9, 7: 78, 8: 34, 9: 27}})

>>> df.groupby('A').size().divide(sum(df['A'].notnull()))
A
34    0.111111
72    0.111111
77    0.777778
dtype: float64

>>> df
    A   B   C
0  77   3  98
1  77  52  99
2  77  58  61
3  77   3  93
4  77  31  99
5  77  53  51
6  77   2   9
7  72  25  78
8  34  41  34
9 NaN  95  27
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • thanks but that's nearly 2X slower when I time it. Thanks for all of the help. I get `1 loops, best of 3: 3.73 s per loop` for the lambda version and `1 loops, best of 3: 7.29 s per loop` for using divide. – horatio1701d Mar 27 '15 at 17:57
  • Really? Which lambda are you referring to, and how big is your source data? Note that the second method below is faster, but milage may vary depending on data: %%timeit df.groupby('A').size().apply(lambda x: float(x) / df.groupby('A').size().sum()*100) 1000 loops, best of 3: 1.82 ms per loop %%timeit df.groupby('A').size().divide(sum(df['A'].notnull())) 1000 loops, best of 3: 555 µs per loop – Alexander Mar 27 '15 at 18:11
  • Here is what I get on my DF which is 3M rows and 59 columns: `df.groupby('A').size().apply(lambda x: float(x) / df.groupby('A').size().sum()) result: 1 loops, best of 3: 3.73 s per loop` and using `df.groupby('A').size().divide(sum(df['A'].notnull()))` I get `1 loops, best of 3: 7.29 s per loop` – horatio1701d Mar 27 '15 at 18:49