specific sumproduct list comprehension in pandas

Question

suppose i have a dataframe

df = pd.DataFrame({"age" : [0, 5, 10, 15, 20], "income": [5, 13, 23, 18, 12]})
  age  income
0    0       5
1    5      13
2   10      23
3   15      18
4   20      12

i want to iterate through df["income"] and calculate the sumproduct as follows (example for age 15): 18+23*(15-10)+13*(15-5)+5*(15-0) = 338.

more generic: income[3] + income[2] * ( age[3] - age[2] ) + income[1] * ( age[3] - age[1] ) + income[0] * (age[3] - age[0] )

I am struggling to formulate the age relative to the current iteration of age ( age[x] - age[y] ) in a generic way to use in a list comprehension or formula.

edit: the actual operation I want to apply is

income[3 ] + income[2]* interest ** ( age[3] - age[2] ) + income[1]*interest ** (age[3] - age[1] ...

exampe from above: 18+23*1.03 ** (15-10)+13*1.03 ** (15-5)+5*1.03 **(15-0) = 69,92

interest = 1.03

ANSWERED thanks to jezrael & mozway

How looks expected final ouput? – jezrael Dec 02 '22 at 12:52 — jezrael, Dec 02 '22 at 12:52
What is number of rows in real data? – jezrael Dec 02 '22 at 14:08 — jezrael, Dec 02 '22 at 14:08

jezrael · Answer 1 · 2022-12-02T13:47:09.537

Numpy solution - you can use broadcasting for avoid loops for improve performance:

df = pd.DataFrame({"age" : [0, 5, 10, 15, 20], "income": [5, 13, 23, 18, 12]})

interest = 1.03
age = df['age'].to_numpy()

Use power with subtracted values of mask:

arr = interest ** (age[:, None] - age ) 
print (arr)
[[1.         0.86260878 0.74409391 0.64186195 0.55367575]
 [1.15927407 1.         0.86260878 0.74409391 0.64186195]
 [1.34391638 1.15927407 1.         0.86260878 0.74409391]
 [1.55796742 1.34391638 1.15927407 1.         0.86260878]
 [1.80611123 1.55796742 1.34391638 1.15927407 1.        ]]

Then set 0 to upper triangle:

arr = np.where(np.triu(np.ones(arr.shape, dtype=bool)), 0, arr)
print (arr)
[[0.         0.         0.         0.         0.        ]
 [1.15927407 0.         0.         0.         0.        ]
 [1.34391638 1.15927407 0.         0.         0.        ]
 [1.55796742 1.34391638 1.15927407 0.         0.        ]
 [1.80611123 1.55796742 1.34391638 1.15927407 0.        ]]

Set 1 to diagonal:

np.fill_diagonal(arr, 1)
print (arr)
[[1.         0.         0.         0.         0.        ]
 [1.15927407 1.         0.         0.         0.        ]
 [1.34391638 1.15927407 1.         0.         0.        ]
 [1.55796742 1.34391638 1.15927407 1.         0.        ]
 [1.80611123 1.55796742 1.34391638 1.15927407 1.        ]]

Multiple by column income and sum per rows:

print (arr * df['income'].to_numpy())
[[ 5.          0.          0.          0.          0.        ]
 [ 5.79637037 13.          0.          0.          0.        ]
 [ 6.7195819  15.07056297 23.          0.          0.        ]
 [ 7.78983708 17.47091293 26.66330371 18.          0.        ]
 [ 9.03055617 20.25357642 30.91007672 20.86693334 12.        ]]


df['new'] = (arr * df['income'].to_numpy()).sum(axis=1)
print (df)   age  income        new
0    0       5   5.000000
1    5      13  18.796370
2   10      23  44.790145
3   15      18  69.924054
4   20      12  93.061143

Performance: For 5k rows, applyare loops under the hood, so slow (best avoid it)

df = pd.DataFrame({"age" : [0, 5, 10, 15, 20], "income": [5, 13, 23, 18, 12]})
df = pd.concat([df] * 1000, ignore_index=True)


In [292]: %%timeit
     ...: age = df['age'].to_numpy()
     ...: 
     ...: arr = interest ** (age[:, None] - age ) 
     ...: arr = np.where(np.triu(np.ones(arr.shape, dtype=bool)), 0, arr)
     ...: np.fill_diagonal(arr, 1)
     ...: df['new'] = (arr * df['income'].to_numpy()).sum(axis=1)
     ...: 
1.39 s ± 69.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [293]: %%timeit
     ...: df['sumproduct'] = (df['age'].expanding().apply(lambda x: sum(df.loc[:x.index[-1], 'income'] * interest**(x.iloc[-1]-x))))
     ...: 
5.13 s ± 411 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@krawall - Performance is not important? `df['sumproduct'] = (df['age'].expanding() .apply(lambda x: sum(df.loc[:x.index[-1], 'income'] * interest**(x.iloc[-1]-x))))` is slow like solution above. — jezrael, Dec 02 '22 at 13:46
It depends how you define performance, the memory consumption explodes with numpy brodcasting (quadratic space complexity), `expanding` runs fine with 100k rows, numpy broadcasting just crashes… — mozway, Dec 02 '22 at 14:02
i am in no position to judge performance really, i gave mozway the right answer because his answer was boiled down in one line of code, which to me as noob is easier to comprehend. also I have 40 rows, so it's probably neglible. i didn't wanna discredit your answer! — krawall, Dec 02 '22 at 14:08
@krawall - Ok. Reason for added intermediate ouputs was for explain how it working, because understand, it is more complicated. And why I dont like apply - [check this](https://stackoverflow.com/questions/54432583/when-should-i-not-want-to-use-pandas-apply-in-my-code/54432584#54432584) and avoid if possible. — jezrael, Dec 02 '22 at 14:14
I'm gonna dive into your answer, too. but right now I probably don't know enough about python to truly appreciate it :) — krawall, Dec 02 '22 at 14:16
@krawall - sure, mozway first answer is super, second is looping, but agree, it is less typing, more clear. It is up to you, what solution use (especially if small data) Happy coding! — jezrael, Dec 02 '22 at 14:19

mozway · Accepted Answer · 2022-12-02T13:39:27.117

You can rewrite 18+(18+23+13+5)*15-(23+13+5)*(10+5+0) to be 18+(18+23+13+5)*15-(23+13+5)*(10+5+0):

The general formula is thus:

sumproduct(n) = (income
                + (n-1)*sum(age[:n-1]*income[:n-1])
                - sum(age[:n-1]*income[:n-1])
               )

As code:

df['sumproduct'] = (df['income']
 .add(df['age'].mul(df['income'].cumsum().shift(fill_value=0)))
 .sub(df['age'].mul(df['income']).cumsum().shift(fill_value=0))
)

output:

   age  income  sumproduct
0    0       5           5
1    5      13          38
2   10      23         138
3   15      18         338
4   20      12         627

power

powers are more complex as you cannot directly factorize, you can however rewrite the operation with expanding:

df['sumproduct'] = (df['age'].expanding()
 .apply(lambda x: sum(df.loc[:x.index[-1], 'income'] * interest**(x.iloc[-1]-x)))
)

Output:

   age  income  sumproduct
0    0       5    5.000000
1    5      13   18.796370
2   10      23   44.790145
3   15      18   69.924054
4   20      12   93.061143

this looks very good. I have one more question: my actual operation is income * interest ** age. I figure for the power i use .pow instead of .mul . where do i add the multiplication with the interest. >test['sumproduct'] = (test['income'] .add(test['age'].pow(test['income'].mul(interest).cumsum().shift(fill_value=0))) .sub(test['age'].pow(test['income'].mul(interest)).cumsum().shift(fill_value=0)) ) — krawall, Dec 02 '22 at 12:45
Can you add this input/output example in the question for clarity? — mozway, Dec 02 '22 at 12:46

specific sumproduct list comprehension in pandas

2 Answers2

power