-1

How can I calculate a group-wise percentage in pandas?

similar to Pandas: .groupby().size() and percentages or Pandas Very Simple Percent of total size from Group by I want to calculate the percentage of a value per group.

How can I achieve this?

My dataset is structured like

ClassLabel, Field

Initially, I aggregate on both ClassLbel and Field like

grouped = mydf.groupby(['Field', 'ClassLabel']).size().reset_index()
grouped = grouped.rename(columns={0: 'customersCountPerGroup'})

Now I would like to know the percentage of customers in each group on a per group basis. The groups total can be obtained like mydf.groupby(['Field']).size() but I neither can merge that as a column nor am I sure this is the right approach - there must be something simpler.

edit

I want to calculate the percentage only based on a single group e.g. 3 0 0.125 1 0.250 the sum of 0 + 1 --> 0.125 + 0.250 = 0,375 and use this value to devide / normalize grouped and not grouped.sum() enter image description here

Community
  • 1
  • 1
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

4

IIUC you can use:

mydf = pd.DataFrame({'Field':[1,1,3,3,3],
                   'ClassLabel':[4,4,4,4,4],
                   'A':[7,8,9,5,7]})

print (mydf)
   A  ClassLabel  Field
0  7           4      1
1  8           4      1
2  9           4      3
3  5           4      3
4  7           4      3

grouped = mydf.groupby(['Field', 'ClassLabel']).size()
print (grouped)
Field  ClassLabel
1      4             2
3      4             3
dtype: int64

print (100 * grouped / grouped.sum())
Field  ClassLabel
1      4             40.0
3      4             60.0
dtype: float64

grouped = mydf.groupby(['Field', 'ClassLabel']).size().reset_index()
grouped = grouped.rename(columns={0: 'customersCountPerGroup'})
print (grouped)
   Field  ClassLabel  customersCountPerGroup
0      1           4                       2
1      3           4                       3

grouped['per'] = 100 * grouped.customersCountPerGroup / grouped.customersCountPerGroup.sum()
print (grouped)
   Field  ClassLabel  customersCountPerGroup   per
0      1           4                       2  40.0
1      3           4                       3  60.0

EDIT by comment:

mydf = pd.DataFrame({'Field':[1,1,3,3,3,4,5,6],
                   'ClassLabel':[0,0,0,1,1,0,0,6],
                   'A':[7,8,9,5,7,5,6,4]})

print (mydf)

grouped = mydf.groupby(['Field', 'ClassLabel']).size()
df =  grouped / grouped.sum()

df = (grouped / df.groupby(level=0).transform('sum')).reset_index(name='new')
print (df)
   Field  ClassLabel       new
0      1           0  8.000000
1      3           0  2.666667
2      3           1  5.333333
3      4           0  8.000000
4      5           0  8.000000
5      6           6  8.000000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • But if I do not want to use `grouped.sum()` but really perform this operation for each group separately e.g. use the sum of class 0 and class 1 for field 1 as the total. How can I access that? – Georg Heiler Sep 08 '16 at 12:47
  • Sorry, can you add it to question, what is desired output from input? the best is modify or use my sample or create another one, because I am not sure if I understand you. – jezrael Sep 08 '16 at 12:55
  • ok, i add solution, but I am not still sure, if I understand you. please check it. – jezrael Sep 08 '16 at 13:12
  • Thanks. Indeed this is what I wanted. Thanks a lot. How could I add the column created by `grouped / sum_grouped` to the df created by `.reset_index()`? I only get incompatible index of inserted column – Georg Heiler Sep 08 '16 at 13:21
  • Try `.reset_index(name='new')` – jezrael Sep 08 '16 at 13:21
  • 1
    `df = (mydf.groupby(['Field', 'ClassLabel']).size() / df.groupby(level=0).transform('sum')).reset_index(name='new') ` – jezrael Sep 08 '16 at 13:23
  • Indeed this is seems to work but it seems to be a very time consuming operation compared to the statements above which were instant. – Georg Heiler Sep 08 '16 at 13:30
  • I simplify it, now it is faster. – jezrael Sep 08 '16 at 13:43