1

A recursive function is difficult to vectorize because each input at time t depends on the previous input at time t-1.

[Question updated below with slightly more complex example x_t = a x_{t-1} + b.]

Issue with .loc returning different data types

import pandas
df1 = pandas.DataFrame({'year':range(2020,2024),'a':range(3,7)})
# Set the initial value
t0 = min(df1.year)
df1.loc[df1.year==t0, "x"] = 0

This assignment doesn't work when the right side of the equation is a pandas.core.series.Series

for t in range (min(df1.year)+1, max(df1.year)+1):
    df1.loc[df1.year==t, "x"] = df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]
print(df1)
#    year  a    x
# 0  2020  3  0.0
# 1  2021  4  NaN
# 2  2022  5  NaN
# 3  2023  6  NaN
print(type(df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]))
# <class 'pandas.core.series.Series'>

The assignment works when the right side of the equation is a numpy array

for t in range (min(df1.year)+1, max(df1.year)+1):
    df1.loc[df1.year==t, "x"] = (df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]).unique()
    #break
print(df1)
#    year  a     x
# 0  2020  3   0.0
# 1  2021  4   3.0
# 2  2022  5   7.0
# 3  2023  6  12.0
print(type((df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]).unique()))
# <class 'numpy.ndarray'>

The assignment works directly when the .loc() selection is using a year index

df2 = df.set_index("year").copy()
# Set the initial value
df2.loc[df2.index.min(), "x"] = 0
for t in range (df2.index.min()+1, df2.index.max()+1):
    df2.loc[t, "x"] = df2.loc[t-1, "x"] + df2.loc[t-1,"a"]
    #break
print(df2)
#       a     x
# year
# 2020  3   0.0
# 2021  4   3.0
# 2022  5   7.0
# 2023  6  12.0
print(type(df2.loc[t-1, "x"] + df2.loc[t-1,"a"]))
# <class 'numpy.float64'>
  • type(df1.loc[df1.year==t-1,"x"] + df1.loc[df1.year==t-1,"a"]) is a pandas series while type(df2.loc[t-1, "x"] + df2.loc[t-1,"a"]) is a numpy float. Why are these types different?
  • If I do not want to use set_index() before the computation. Is there a better way to write a recursive .loc() assignment than to use .unique()?

See also:

Example using multiplicative and additive component

Our real problem is more complicated since there is a multiplicative and an additive component

import pandas
df3 = pandas.DataFrame({'year':range(2020,2024),'a':range(3,7), 'b':range(8,12)})
df3 = df3.set_index("year").copy()
# Set the initial value
df3.loc[df3.index.min(), "x"] = 0
for t in range (df3.index.min()+1, df3.index.max()+1):
    df3.loc[t, "x"] = df3.loc[t-1, "x"] * df3.loc[t-1, "a"] + df3.loc[t-1, "b"]
    #break
print(df3)
Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110
  • 1
    So there are recursive operations which _must_ use a loop (i.e resetting a cumulative sum once a threshold is met) and then there are other recursive operations which _can_ be re-written as a vectorized operation, typically using some `shift` or `expanding` calculation. In your example the recursion is a simple shifted `cumsum`: `df1['x'] = df1['a'].shift().cumsum().fillna(0)`, but it's unclear if this is just a simplified example for the sake of a mcve. – ALollz Nov 16 '21 at 17:48
  • 1
    Thank you for putting me on the right path. I over simplified the example in my question. In our real problem, there is a multiplicative and an additive component x_t = a x_{t-1} + b. I should be able to replace the loop by splitting the computation between `cumprod()` and `cumsum()`, that might make the code more obscure for my colleague though. – Paul Rougieux Nov 17 '21 at 08:45

1 Answers1

1

sorry if I don't understand, do you want this?

df1['x']= df1['a'].cumsum().shift().fillna(0)
print(df1)

output:

   year  a     x
0  2020  3   0.0
1  2021  4   3.0
2  2022  5   7.0
3  2023  6  12.0
Wilian
  • 1,247
  • 4
  • 11
  • The real problem has a multiplicative and an additive component x_t = a x_{t-1} + b. I updated the question to reflect this. – Paul Rougieux Nov 17 '21 at 09:20