Pandas complex math in groupby+aggregation

Question

I want to run some complex math while aggregating. I wrote the aggregation function:

import math as mt
# convert calc cols to float from object
cols = dfg_dom_smry.columns
cols = cols[2:]
for col in cols:
    df[col] = df[col].astype(float)

# groupby fields
index = ['PA']

#aggregate
df = dfdom.groupby(index).agg({'newcol1': (mt.sqrt(sum('savings'*'rp')**2))/sum('savings')})

I got an error: TypeError: can't multiply sequence by non-int of type 'str'

This is an extract of my data. The full data has many set of savings and rp columns. So ideally I want to run a for loop for each set of savings and rp columns

PA   domain          savings           rp 
 M   M-RET-COM       383,895.36      0.14 
 P   P-RET-AG        14,302,804.19   0.16 
 P   P-RET-COM       56,074,119.28   0.33 
 P   P-RET-IND       46,677,610.00   0.27 
 P   P-SBD/NC-AG     1,411,905.00    -   
 P   P-SBD/NC-COM    4,255,891.25    0.36 
 P   P-SBD/NC-IND    295,365.00      -   
 S   S-RET-AG        2,391,504.33    0.72 
 S   S-RET-COM       19,195,073.84   0.18 
 S   S-RET-IND       17,677,708.38   0.13 
 S   S-SBD/NC-COM    6,116,407.07    0.05 
 D   D-RET-COM       11,944,490.39   0.15 
 D   D-RET-IND       1,213,117.63    -   
 D   D-SBD/NC-COM    2,708,153.57    0.69 
 C   C-RET-AG       
 C   C-RET-COM      
 C   C-RET-IND

For the above data this would be the final result:

 PA  newcol1
 M   0.143027374757981 
 P   0.18601700701305
 S   0.0979541706738756
 D   0.166192684106493
 C

thanks for your help

Scott, that's what this does: `for col in cols: df[col] = df[col].astype(float)`. I confirmed after that savings is in float64 format. Plus I commented out the float conversion for the solution I posted below and it worked fine. Any other suggestions? — SModi, Nov 17 '20 at 21:04
Putting aside the operability of the expression: What's the point with `mt.sqrt(sum('savings'*'rp')**2)` ? At first, I thought that what you wanted was the sum of the squares. But it looks like you want the square of the sum.. And then the square root... Which makes little sense without, say, negative values... I think that what you want is the sum of the square values... — keepAlive, Nov 17 '20 at 21:50
that's correct, it's done to work with absolute numbers. This is a rollup of relative precision. You want to include negative savings in the calculations and so squaring first takes care of the negativeness. — SModi, Nov 17 '20 at 21:53
And why not use `abs` then ? Which would promote readability I think. — keepAlive, Nov 17 '20 at 21:59
good question. I think it maybe something else because (a2 + b2 + c2)^1/2 is not equal to a+b+c even if there were no negative numbers. I'm not a statistician. Will go back to my colleague who is, and find out. — SModi, Nov 17 '20 at 22:24
@keepAlive I have an answer from our statistician in case you're interested. Posted over multiple comments as too long: "The method you are referring to is called propagation in quadrature, or summation in quadrature, and the point of it is not to deal with negative values. The aim of the method is to produce an uncertainty or error value for a single estimate that is composed of multiple estimates from different sources. A reference for it is in "An Introduction to Error Analysis" by John R. Taylor. idk the current edition. — SModi, Nov 18 '20 at 09:09
That's for the basic method of accounting for multiple measurements, each with its own uncertainty or error. Then Roger Wright and Jacobs (I think) expanded it for our industry by incorporating savings with the related uncertainties. Relative precision is one type of error or uncertainty, and they use that in combination with associated savings levels to roll up results of multiple programs or PAs, or whatever unit of analysis is relevant so that the error will be minimized by accumulating savings from all the programs, using them, essentially as weights. — SModi, Nov 18 '20 at 09:09
This summation in quadrature is core to estimating relative precision for a single estimate, when summing the results of multiple entities into one estimate and its associated standard error. Here is the reference for the specific way we have used the summation in quadrature method: A Methodology For Integration of Evaluation Studies” (Wright and Jacobson 1993). — SModi, Nov 18 '20 at 09:09

SModi · Answer 1 · 2020-11-17T22:35:46.040

update: I replaced the code below with that in the accepted answer.

This is what I finally did. I created a function to step through each of the calculations and call the function for each set of savings and rp cols.

# the function
def rollup(df, index, svgs, rp):
    df['svgs_rp'] = (df[svgs]*df[rp])**2
    df2 = df.groupby(index).agg({'svgs_rp':'sum',
                                 svgs:'sum'})
    df2['temp'] = np.where((df2[svgs] == 0), '', ((df2['svgs_rp']**(1/2))/df2[svgs]))
    df2 = df2['temp']
    return df2

#calling the function
index = ['PA']

# the next part is within a for loop to go through all the savings and rp column sets. For simplicity I've removed the loop.
svgs = 'savings1' 
rp = 'rp1'
dftemp = rollup(dfdom, index, svgs, rp)
dftemp.rename({'temp': newcol1}, axis=1, inplace=True)
df = pd.merge(df, dftemp, on=index, how = 'left') # this last step was not put in my original question. I've provided so the r-code below makes sense.

annoying that I have to first do the math in new columns and then aggregate. This is the equivalent code in R:

# the function
roll_up <- function(savings,rp){
   return(sqrt(sum((savings*rp)^2,na.rm=T))/sum(savings,na.rm=T))

# calling the function
df <- df[dfdom[,.(newcol1=roll_up(savings1,rp1),
                  newcol2=roll_up(savings2,rp2),...
                  newcol_n=roll_up(savings_n,rp_n)),by='PA'],on='PA']

I'm relatively new to python programming, and this the best I could come up with. If there is a better way to do this, please share. Thanks.

keepAlive · Accepted Answer · 2020-11-19T15:52:51.117

1

What about

o = dfdom.groupby(index).apply(
    lambda s: pow(pow(s.savings*s.rp, 2).sum(), .5)/(s.savings.sum() or 1)
)

?

Where s above stands for pandas.Series.

Also, note that o is an instance of pandas.Series, which means that you will have to convert it into a pandas.DataFrame, at least to justify the name you give it, i.e. df. You can do so by doing:

df = o.to_frame('the column name you want')

Put differently/parametrically

def rollup(df, index, svgs, rp, col_name):
    return df.groupby(index).apply(
        lambda s: pow(pow(s[svgs]*s[rp], 2).sum(), .5)/(s[svgs].sum() or 1)
    ).to_frame(col_name)

# rollup(dfdom, index, 'savings', 'rp', 'svgs_rp')

edited Nov 19 '20 at 15:52

answered Nov 17 '20 at 21:14

keepAlive

6,369
5
24
39

thanks I've added the expected result in my question. Will your solution work when the denominator `s.savings.sum()` is 0? – SModi Nov 17 '20 at 21:36
1

@Smodi Yup, see the last edit, i.e. the `(s[svgs].sum() or 1)` chunk. Indeed, `0 or 1` returns 1. Such *expression* is called [Short-Circuit Evaluation](https://stackoverflow.com/questions/2580136/does-python-support-short-circuiting). – keepAlive Nov 17 '20 at 21:38
brilliant. just incorporated into my code and gave the same results as my first code. I also wasn't aware of pow. More neat. Thanks. – SModi Nov 17 '20 at 22:01
how would this code work if I don't have want to groupby index? I tried ```return df.apply(lambda s: pow(pow((s[svgs]*s[rp]).sum(), 2), .5)/(s[svgs].sum() or 1)).to_frame(col_name)``` I think the issue is the output is no longer a series. – SModi Nov 19 '20 at 13:34
FYI I was able to do it this way: ```return pd.DataFrame([pow(pow(dfg_pa_smry[svgs]*dfg_pa_smry[rp],2).sum(),.5)/dfg_pa_smry[svgs].sum()], columns=[colname])```, but would like to see if it can also be done with the above approach. – SModi Nov 19 '20 at 13:51
sorry am going back and forth on this. tired brains. You do need the sum. The final output is a single value. but the input data is the output from the first dataset - i.e. multiple lines. – SModi Nov 19 '20 at 13:56
No problem, you are welcome @Smodi. What does `type(dfg_pa_smry)` return ? I notice that what you want in this case is what I was telling you yesterday, the square root of the sum. The point is that `pow(pow(dfg_pa_smry[svgs]*dfg_pa_smry[rp],2).sum(),.5)/dfg_pa_smry[svgs].sum()` returns a simple value. Why put it in a dataframe ? – keepAlive Nov 19 '20 at 14:05
rewrote code to keep consistent with format used here: ```return pd.DataFrame([pow(pow(df[svgs]*df[rp],2).sum(),.5)/df[svgs].sum()], columns=[colname])``` Your original/this code is called from a for loop for a set of svgs and rp columns. Each output is then added to a master database that already holds other results for that level of aggregation. Putting it in a dataframe here makes the function output consistent and I don't have to write different code for when the output is a simple value. – SModi Nov 19 '20 at 15:05
@Smodi ok! Good practice. What about `pd.DataFrame({colname: {'PA':(pow(pow(dfg_pa_smry[svgs]*dfg_pa_smry[rp],2).sum(),.5)/dfg_pa_smry[svgs].sum())}})` ? Where `'PA'` actually stands for the name of the row. – keepAlive Nov 19 '20 at 15:23
please can you correct your answer. I think your originally had it right. ```lambda s: pow(pow((s[svgs]*s[rp]).sum(), 2), .5)/(s[svgs].sum() or 1)``` should be ```lambda s: pow(pow((s[svgs]*s[rp]), 2).sum(), .5)/(s[svgs].sum() or 1)``` (the first sum comes after pow 2. Also make this the final answer and hash out the abs version given it does not do what I want as discussed in comments to my question. – SModi Nov 19 '20 at 15:47
@SModi I have rolled back my answer accordingly. – keepAlive Nov 19 '20 at 15:53

score 0 · Answer 3 · answered Nov 17 '20 at 15:17

0

Your groupby should have () and then the [] like this:

df = dfdom.groupby([index]).agg({'newcol1': (mt.sqrt(sum('savings'*'rp')^2))/sum('savings')})

answered Nov 17 '20 at 15:17

Arvind Kumar

451
2
10

Thanks I fixed that. Have also updated my question for this. Still get the following error: TypeError: can't multiply sequence by non-int of type 'str' – SModi Nov 17 '20 at 15:27
sorry I changed in my code but not here. Now changed. The updated error is still correct. It doesn't work – SModi Nov 17 '20 at 19:14
Also I don't think the [] is required inside the (). because index is created as a list – SModi Nov 17 '20 at 21:05
yeah [] are required when you are grouping using more than one column. – Arvind Kumar Nov 18 '20 at 06:11

Pandas complex math in groupby+aggregation

3 Answers3

update: I replaced the code below with that in the accepted answer.