The fastest way to manipulate data in a dataframe is through vectorization.
Let me explain using below code for 1,000,000 records:
import pandas as pd
import numpy as np
from time import time
df = pd.DataFrame({
'ind_1': np.random.randint(-1, 2, size=(1000000, )),
'ind_2': np.random.randint(0, 101, size=(1000000, ))
})
threshold = 40
df['signal'] = np.where(np.logical_and(df['ind_1'] == 1, df['ind_2'] <= threshold), 1, 0)
df['signal'] = np.where(np.logical_and(df['ind_1'] == -1, df['ind_2'] >= 100 - threshold), -1, df['signal'])
df2 = df.copy() #duplicate dataframe for comparison
I'm using df
to measure the time taken for the original solution.
And the duplicated df2
to measure time taken for my proposed solution.
In my proposed solution, I'm using df2['signal_shift1'] = df2['signal'].shift(1)
to move the signal
column down 1 row, so that the record can be compared across the same row.
Then your original conditions:
A: df.at[i, 'signal'] == 0
and
B: df.at[i - 1, 'signal'] != 0
and
C: df.at[i, 'ind_1'] == 0
becomes this, comparable on the same row:
A&C: df2['signal'].abs() + df2['ind_1'].abs() == 0
and
B: df2['signal_shift1'] != 0
Note that I've combined conditions A==0 and C==0 to become abs(A)+abs(C)==0
#original solution (using df)
ti = time()
for i in range(1, len(df)):
if df.at[i, 'signal'] == 0 and df.at[i - 1, 'signal'] != 0:
if df.at[i, 'ind_1'] == 0:
df.at[i, 'signal'] = df.at[i - 1, 'signal']
print('Time taken original solution: {:.3f} sec'.format(time() - ti))
#proposed solution (using df2)
ti = time()
df2['signal_compare'] = 0 #initialize break condition
while not df2['signal_compare'].equals(df2['signal']):
df2['signal_compare'] = df2['signal'].copy() #condition to break while-loop
df2['signal_shift1'] = df2['signal'].shift(1)
df2.at[0, 'signal_shift1'] = df2.at[0, 'signal'] #to remove null value after .shift(1)
df2['signal'] = np.where(np.logical_and(df2['signal'].abs() + df2['ind_1'].abs() == 0, df2['signal_shift1'] != 0), df2['signal_shift1'], df2['signal']).astype('int')
print('Time taken proposed solution: {:.3f} sec'.format(time() - ti))
#check if original solution and proposed solution are the same
print('Output columns are the same:', df['signal'].equals(df2['signal']))
print(df2)
The output shows that the proposed solution is completed within a shorter time, while having the same column signal
results. It's very obvious that vectorization is much faster than for-loop iteration!
Time taken original solution: 6.501 sec
Time taken proposed solution: 0.202 sec
Output columns are the same: True
ind_1 ind_2 signal signal_compare signal_shift1
0 -1 80 -1 -1 -1.0
1 0 14 -1 -1 -1.0
2 0 14 -1 -1 -1.0
3 1 46 0 0 -1.0
4 0 23 0 0 0.0
... ... ... ... ... ...
999995 0 70 0 0 0.0
999996 -1 88 -1 -1 0.0
999997 0 73 -1 -1 -1.0
999998 -1 39 0 0 -1.0
999999 -1 83 -1 -1 0.0
[1000000 rows x 5 columns]