TL;DR: go to the benchmark code and use Method 1.
Short Answer
No. Vectorization is not possible.
Long Answer
Theorem: For this particular task, the output of a given row cannot be determined using any finite length of backward rolling window smaller than the partial length up to this row.
Thus, there is no way for this output logic to be processed in a vectorized way. (See this answer for an idea of vectorization is performed in CPUs). The output can only be computed from the beginning of the dataframe.
Proof: Consider a target row of a dataframe df
. Assume there is a backward rolling window with size n < partial length
, so a previous value of df["E"]
exists before the window. We denote this previous value by state
.
Consider a special case: df["C"] == -1
and df["D"] == 1
within the window.
- Case 1 (
state < 0
): The output within this rolling window will be [1, -1, 1, -1, .....], making the last element (-1)^(n-1)
- Case 2 (
state >= 0
): The output will be [-1, 1, -1, 1, .....], making the last element (-1)^(n)
Therefore, it is possible for the output df["E"]
of the target row to be dependent on a state variable outside the window. QED.
Useful Answer
Although vectorization is impossible, it does not mean that significant acceleration cannot be achieved. A simple yet very efficient approach is using a numba-compiled generator
to perform the sequential generation. It only requires re-writing your logic into a generator function and add two additional lines:
import numba
@numba.njit
def my_generator_func():
....
Of course, you may have to install numba first. If this is not possible, then using a plain generator without numba optimization is also fine.
Benchmark
The benchmark is performed on a i5-8250U (4C8T) laptop with 16GB RAM running 64-bit debian 10. Python version is 3.7.9 and pandas is 1.1.3. n = 10^7
(10 million) records are generated for benchmarking purposes.
Result:
1. numba-njit: 2.48s
2. plain generator (no numba): 5.13s
3. original: 271.15s
> 100x
efficiency gain can be achieved against the original code.
Code
from datetime import datetime
import pandas as pd
import numpy as np
n = 10000000 # a large number of rows
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
#print(df.head())
# ========== Method 1. generator + numba njit ==========
ti = datetime.now()
import numba
@numba.njit
def gen(plus: np.array, minus: np.array):
l = len(plus)
assert len(minus) == l
# first
state = minus[0]
yield state
# second to last
for i in range(l-1):
state = minus[i+1] if state < 0 else plus[i+1]
yield state
df["E"] = [i for i in gen(df["C"].values, df["D"].values)]
tf = datetime.now()
print(f"1. numba-njit: {(tf-ti).total_seconds():.2f}s") # 1. numba-njit: 0.47s
# ========== Method 2. Generator without numba ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()
def gen_plain(plus: np.array, minus: np.array):
l = len(plus)
assert len(minus) == l
# first
state = minus[0]
yield state
# second to last
for i in range(l-1):
state = minus[i+1] if state < 0 else plus[i+1]
yield state
df["E"] = [i for i in gen_plain(df["C"].values, df["D"].values)]
tf = datetime.now()
print(f"2. plain generator (no numba): {(tf-ti).total_seconds():.2f}s") #
# ========== Method 3. Direct iteration ==========
df = pd.DataFrame({"C": -np.ones(n), "D": np.ones(n)})
ti = datetime.now()
# code provided by the OP
df['E']=np.nan
df.at[0,'E'] = df['D'][0]
l=len(df)
for i in range(l - 1):
if df['E'][i] < 0:
df.at[i+1,'E'] = df['D'][i+1]
else:
df.at[i+1,'E'] = df['C'][i+1]
tf = datetime.now()
print(f"3. original: {(tf-ti).total_seconds():.2f}s") # 2. 26.61s