Assistance creating calculated columns with Pandas Groupby

Question

I have a pandas Dataframe, called relevant_URT_data, that looks like this:

test MRN Number ResCat
12 AP Disbursements Payment Details URT INC3778700 0
33 AP Disbursements Payment Details NaN INC3783080 0
72 AP Disbursements Payment Details URT INC3782671 0
150 AP Quality Assurance Payment Status URT INC3778770 0
178 PR HBS Inquiry NaN INC3776742 1
192 AP Quality Assurance Payment Status NaN INC3778547 0
315 AP Quality Assurance Payment Status URT INC3780548 0
328 PR Accounting W-2 Form URT INC3782016 0
355 AP General Submit Invoice for Payment URT INC3781884 0
374 AP General Inquiry NaN INC3775944 0

I use the following code to group the data by test - please see image below.

test_breakdown = relevant_URT_data[["test","MRN","Number"]] \
        .groupby("test") \
        .agg({'MRN':'count', 'Number':'size'}) \
        .rename(columns={'MRN':'URT Use Count','Number':'Number'})

test URT Use Count Number
AP Connexxus Access Request 9 9
AP Disbursements Payment Details 28 35
AP General Inquiry 1 7
AP General Submit Invoice for Payment 25 27
AP General WebNow/Invoice Copies 0 4
AP MyExpense Access Request 3 3
AP MyExpense Grant Delegated Access 0 1
AP MyExpense Inquiry 2 8
AP Quality Assurance Payment Status 56 71
Controller's Office General Medical Center 7 10
PR Accounting W-2 Form 6 9
PR HBS Inquiry 0 17
PR HBS Timesheet Calculations 0 2
PR Processing and Production Direct Deposit 2 2
PR Processing and Production Payment Details 0 1
PR Verification of Employment How to Request VOE 1 1
PR Verification of Employment Written VOE 2 4

My goal is to add 2 more columns to the grouped dataset:

1) A column that shows 'URT Use Count' divided by 'Number' (I want the percentage of Number that is MRN) 2) A column that shows ResCat == 0 divided by Number (I want the percentage of Number that has ResCat equal to 0)

Here is optimally what I would want the output to look like- the two new columns are displayed below:

test URT Use Count Number Percentage_Use Same_Day_Percentage
AP Connexxus Access Request 9 9 100% 55%
AP Disbursements Payment Details 28 35 80% 77%
AP General Inquiry 1 7 14% 92%
AP General Submit Invoice for Payment 25 27 92% 97%
AP General WebNow/Invoice Copies 0 4 0% 19%
AP MyExpense Access Request 3 3 100% 50%
AP MyExpense Grant Delegated Access 0 1 0% 50%
AP MyExpense Inquiry 2 8 25% 77%
AP Quality Assurance Payment Status 56 71 79% 88%
Controller's Office General Medical Center 7 10 70% 20%
PR Accounting W-2 Form 6 9 67% 20%
PR HBS Inquiry 0 17 0% 100%
PR HBS Timesheet Calculations 0 2 0% 45%
PR Processing and Production Direct Deposit 2 2 100% 99%
PR Processing and Production Payment Details 0 1 0% 15%
PR Verification of Employment How to Request VOE 1 1 100% 12%
PR Verification of Employment Written VOE 2 4 50% 22%

Any clarification on creating these calculated fields would be extremely helpful.

Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. — jezrael, Mar 30 '18 at 09:39
Super edit, thank you. Is possible add expected output from your input data, especially new columns what need append? — jezrael, Mar 30 '18 at 09:54
Thanks @MrT for the feedback/direction - I've cleaned it up. — Samuel Anspach, Apr 02 '18 at 00:16
Thanks @jezrael for the feedback/answer below - I've made the updates. — Samuel Anspach, Apr 02 '18 at 00:17

jezrael · Accepted Answer · 2018-03-30T09:30:28.047

IIUC need:

df = pd.DataFrame({'A':list('abcdef'),
                   'MRN':[4,5,4,5,5,np.nan],
                   'Number':[7,8,9,4,2,3],
                   'ResCat':[0,1,0,0,1,1],
                   'test':list('aaabbb')})

print (df)
   A  MRN  Number  ResCat test
0  a  4.0       7       0    a
1  b  5.0       8       1    a
2  c  4.0       9       0    a
3  d  5.0       4       0    b
4  e  5.0       2       1    b
5  f  NaN       3       1    b

df1 = df.groupby("test") \
        .agg({'MRN':'count', 'Number':'size', 'ResCat':lambda x: (x == 0).sum()}) \
        .rename(columns={'MRN':'URT Use Count'}) \
        .reset_index()

print (df1)
  test  URT Use Count  Number  ResCat
0    a              3       3       2
1    b              2       3       1

df1[['a','b']] = df1[['ResCat','URT Use Count']].div(df1['Number'], axis=0)
print (df1)
  test  URT Use Count  Number  ResCat         a         b
0    a              3       3       2  0.666667  1.000000
1    b              2       3       1  0.333333  0.666667

Thank you so much - the solution worked and I learned more about creating calculated fields in Pandas. — Samuel Anspach, Apr 02 '18 at 00:42

Assistance creating calculated columns with Pandas Groupby

1 Answers1