How to optimize a for loop in python that references values in a data frame that are updated in the loop

Question

The code below first updates the signal column with a 1 or -1 if certain conditions are met, otherwise the signal column is set to 0.

In the for-loop, the signal column gets updated to the previous signal value if certain conditions are met. I would like to replace the for-loop with a faster solution that is still able to update the signal column with the previous signal value considering the previous signal value could be updated through the process.

df['signal'] = np.where(np.logical_and(df['ind_1'] == 1, df['ind_2'] <= threshold), 1, 0)
df['signal'] = np.where(np.logical_and(df['ind_1'] == -1, df['ind_2'] >= 100 - threshold), -1, df['signal'])
for i in range(1, len(df)):
    if df.at[i, 'signal'] == 0 and df.at[i - 1, 'signal'] != 0:
        if df.at[i, 'ind_1'] == 0:
            df.at[i, 'signal'] = df.at[i - 1, 'signal']

Might be useful: [cs95's answer on "How to iterate over rows in a DataFrame in Pandas"; tldr DON'T](/a/55557758/4518341) — wjandrea, Jul 19 '23 at 01:10
It might help to make a [mre]. For specifics see [How to make good reproducible pandas examples](/q/20109391/4518341). — wjandrea, Jul 19 '23 at 01:22

blackraven · Accepted Answer · 2023-07-19T15:00:18.533

The fastest way to manipulate data in a dataframe is through vectorization.
Let me explain using below code for 1,000,000 records:

import pandas as pd
import numpy as np
from time import time

df = pd.DataFrame({
    'ind_1': np.random.randint(-1, 2, size=(1000000, )),
    'ind_2': np.random.randint(0, 101, size=(1000000, ))
})
threshold = 40
df['signal'] = np.where(np.logical_and(df['ind_1'] == 1, df['ind_2'] <= threshold), 1, 0)
df['signal'] = np.where(np.logical_and(df['ind_1'] == -1, df['ind_2'] >= 100 - threshold), -1, df['signal'])

df2 = df.copy()    #duplicate dataframe for comparison

I'm using df to measure the time taken for the original solution.
And the duplicated df2 to measure time taken for my proposed solution.

In my proposed solution, I'm using df2['signal_shift1'] = df2['signal'].shift(1) to move the signal column down 1 row, so that the record can be compared across the same row.

Then your original conditions:
A: df.at[i, 'signal'] == 0 and
B: df.at[i - 1, 'signal'] != 0 and
C: df.at[i, 'ind_1'] == 0

becomes this, comparable on the same row:
A&C: df2['signal'].abs() + df2['ind_1'].abs() == 0 and
B: df2['signal_shift1'] != 0
Note that I've combined conditions A==0 and C==0 to become abs(A)+abs(C)==0

#original solution (using df)
ti = time()
for i in range(1, len(df)):
    if df.at[i, 'signal'] == 0 and df.at[i - 1, 'signal'] != 0:
        if df.at[i, 'ind_1'] == 0:
            df.at[i, 'signal'] = df.at[i - 1, 'signal']
print('Time taken original solution: {:.3f} sec'.format(time() - ti))

#proposed solution (using df2)
ti = time()
df2['signal_compare'] = 0    #initialize break condition
while not df2['signal_compare'].equals(df2['signal']):
    df2['signal_compare'] = df2['signal'].copy()    #condition to break while-loop
    df2['signal_shift1'] = df2['signal'].shift(1)
    df2.at[0, 'signal_shift1'] = df2.at[0, 'signal']    #to remove null value after .shift(1)
    df2['signal'] = np.where(np.logical_and(df2['signal'].abs() + df2['ind_1'].abs() == 0, df2['signal_shift1'] != 0), df2['signal_shift1'], df2['signal']).astype('int')
print('Time taken proposed solution: {:.3f} sec'.format(time() - ti))

#check if original solution and proposed solution are the same
print('Output columns are the same:', df['signal'].equals(df2['signal']))
print(df2)

The output shows that the proposed solution is completed within a shorter time, while having the same column signal results. It's very obvious that vectorization is much faster than for-loop iteration!

Time taken original solution: 6.501 sec
Time taken proposed solution: 0.202 sec
Output columns are the same: True

        ind_1  ind_2  signal  signal_compare  signal_shift1
0          -1     80      -1              -1           -1.0
1           0     14      -1              -1           -1.0
2           0     14      -1              -1           -1.0
3           1     46       0               0           -1.0
4           0     23       0               0            0.0
...       ...    ...     ...             ...            ...
999995      0     70       0               0            0.0
999996     -1     88      -1              -1            0.0
999997      0     73      -1              -1           -1.0
999998     -1     39       0               0           -1.0
999999     -1     83      -1              -1            0.0

[1000000 rows x 5 columns]

if row index 3 signal = 0 and ind_1 = 0 in your example, would this row get updated with your code? I need all rows that meet the conditions to be updated — Eric Cepress, Jul 19 '23 at 04:20
Yea, you're right! The 'signal' value in previous rows do change, so I'd need to apply the vectorization a few times, let me explore! — blackraven, Jul 19 '23 at 14:18
I've re-used the DataFrame code from @MariaKozlova (thanks buddy!) — blackraven, Jul 19 '23 at 15:03

score 0 · Answer 2 · answered Jul 19 '23 at 01:19

shift method seems to do the trick.

I've generated a mock dataset like this:

threshold = 50
df = pd.DataFrame({
    "ind_1": np.random.randint(-1, 2, size=(10, )),
    "ind_2": np.random.randint(0, 100, size=(10, ))
})

print(df)

Output:

   ind_1  ind_2
0     -1     31
1      1     68
2      0     78
3      0     11
4      0     29
5     -1     81
6      0      2
7      0      3
8      1     99
9      0     55

Then shifting and checking all the conditions:

# your signal calculations
df['signal'] = np.where(np.logical_and(df['ind_1'] == 1, df['ind_2'] <= threshold), 1, 0)
df['signal'] = np.where(np.logical_and(df['ind_1'] == -1, df['ind_2'] >= 100 - threshold), -1, df['signal'])

# for-loop replacement
substitute_idxs = (df["signal"] == 0) & (df["signal"].shift(1) != 0) & (df["ind_1"] == 0)
print(df)
print(substitute_idxs)
df["signal"][substitute_idxs] = df["signal"].shift(1)[substitute_idxs]
print(df)

Output:

   ind_1  ind_2  signal
0     -1     31       0
1      1     68       0
2      0     78       0
3      0     11       0
4      0     29       0
5     -1     81      -1
6      0      2       0
7      0      3       0
8      1     99       0
9      0     55       0

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
8    False
9    False
dtype: bool

   ind_1  ind_2  signal
0     -1     31       0
1      1     68       0
2      0     78       0
3      0     11       0
4      0     29       0
5     -1     81      -1
6      0      2      -1
7      0      3       0
8      1     99       0
9      0     55       0

First part of output is initial dataframe.
Second part - indices which satisfy all conditions and where signal needs to be replaces are marked True.
Third part - dataframe with replaced signals (signal at 6th row was replaced to the previous value -1).

index 7 row should also be updated to -1 since it meets the conditions and the previous signal is now -1. — Eric Cepress, Jul 19 '23 at 04:15

score 0 · Answer 3 · answered Jul 19 '23 at 06:45

0

This is the best solution I have come up:

prev_signal = 0
signal_values = df['signal'].values
ind_1_values = df['ind_1'].values

for i in range(1, len(df)):
    if signal_values[i] == 0 and prev_signal != 0:
        if ind_1_values[i] == 0:
            signal_values[i] = prev_signal
    prev_signal = signal_values[i]

answered Jul 19 '23 at 06:45

Eric Cepress

35
5

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jul 21 '23 at 20:57

How to optimize a for loop in python that references values in a data frame that are updated in the loop

3 Answers3