0

I have a dataframe where the first row is the initial condition.

df = pd.DataFrame({"Year": np.arange(4),
                   "Pop": [0.4] + [np.nan]* 3})

and a function f(x,r) = r*x*(1-x), where r = 2 is a constant and 0 <= x <= 1.

I want to produce the following dataframe by applying the function to column Pop row-by-row iteratively. I.e., df.Pop[i] = f(df.Pop[i-1], r=2)

df = pd.DataFrame({"Year": np.arange(4),
                   "Pop": [0.4, 0.48, 4992, 0.49999872]})

Question: Is it possible to do this in a vectorized way?

I can achieve the desired result by using a loop to build lists for the x and y values, but this is not vectorized.

I have also tried this, but all nan places are filled with 0.48.

df.loc[1:, "Pop"] = R * df.Pop[:-1] * (1 - df.Pop[:-1])
Bill Huang
  • 4,491
  • 2
  • 13
  • 31
Matt
  • 1
  • 2
  • Can you give an example with data explaining the desired behavior. – Siva Kumar Sunku Oct 24 '20 at 16:26
  • The function is f(x) = r * x * (1 - x), where r is a constant and x is a percentage. The first column is the starting conditions, where r = 2,x = 0.4, the index is the time interval. d[0] would be (0, 0.4). d[1] is (1, 0.48). d[2] should be (2, 0.4992), but it's (2, 0.48), the same as the remaining rows. – Matt Oct 24 '20 at 18:29
  • Non of your variables are defined in your code. Please provide a [reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) with input data and expected output. – Michael Szczesny Oct 24 '20 at 20:35
  • May you provide sample data in a [reproducible way](https://stackoverflow.com/questions/20109391) and the expected output? – Bill Huang Oct 24 '20 at 22:52
  • You edited the correct example wrong. `np.arange(51)` gives error because the lengths of lists do not match. Please note that **the primary purpose of an example is for the potential helpers to reproduce your problem**, so focus on the relevant part only and simplify others as much as possible. I am editing your post currently. Please wait for it to take effect. – Bill Huang Oct 25 '20 at 12:25

1 Answers1

1

It is IMPOSSIBLE to do this in a vectorized way.

By definition, vectorization makes use of parallel processing to reduce execution time. But the desired values in your question must be computed in sequential order, not in parallel. See this answer for detailed explanation. Things like df.expanding(2).apply(f) and df.rolling(2).apply(f) won't work.

However, gaining more efficiency is possible. You can do the iteration using a generator. This is a very common construct for implementing iterative processes.

def gen(x_init, n, R=2):
    x = x_init
    for _ in range(n):
        x = R * x * (1-x)
        yield x

# execute            
df.loc[1:, "Pop"] = list(gen(df.at[0, "Pop"], len(df) - 1))

Result:

print(df)
        Pop
0  0.400000
1  0.480000
2  0.499200
3  0.499999

It is completely OK to stop here for small-sized data. If the function is going to be performed a lot of times, however, you can consider optimizing the generator with numba.

  • pip install numba or conda install numba in the console first
  • import numba
  • Add decorator @numba.njit in front of the generator.

Change the number of np.nans to 10^6 and check out the difference in execution time yourself. An improvement from 468ms to 217ms was achieved on my Core-i5 8250U 64bit laptop.

Bill Huang
  • 4,491
  • 2
  • 13
  • 31