I am aware there are some other similar questions but I haven't found a timeit
test with a fairly large dataset. I have a ~1.6M rows DF , and I want to use the fastest way to assign a column ['stag'] a 0-1 value depending on the index weekday and hour.
Is there a faster approach I am missing ? suggestions ?
import timeit
import statistics
...
...
df_backtest['weekday'] = df_backtest.index.weekday
def approach_1():
df_backtest["stag"]=1
df_backtest.loc[(df_backtest.index.hour>=2) & (df_backtest.index.hour<=10),"stag" ]=0
df_backtest.loc[df_backtest.index.strftime('%w')==6,"stag" ]=0
df_backtest.loc[df_backtest.index.strftime('%w')==0,"stag" ]=0
def approach_2():
df_backtest['stag'] = 1
df_backtest.loc[df_backtest.index.hour.isin(range(2, 11)), 'stag'] = 0
df_backtest.loc[df_backtest.index.strftime('%w').isin(['6', '0']), 'stag'] = 0
def approach_3():
df_backtest['stag'] = 1
df_backtest.loc[df_backtest.index.hour.isin(range(2, 11)), 'stag'] = 0
df_backtest.loc[df_backtest['weekday'].isin(['6', '0']), 'stag'] = 0
def approach_4():
df_backtest['stag'] = 1
df_backtest['stag'] = df_backtest['stag'].where((df_backtest.index.hour < 2) | (df_backtest.index.hour > 10))
df_backtest['stag'] = df_backtest['stag'].where(~df_backtest['weekday'].isin(['6', '0']))
num_repeats = 10
num_loops = 5
print(f'Approach 1: {statistics.mean(timeit.repeat(approach_1, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print(f'Approach 2: {statistics.mean(timeit.repeat(approach_2, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print(f'Approach 3: {statistics.mean(timeit.repeat(approach_3, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print(f'Approach 4: {statistics.mean(timeit.repeat(approach_4, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print('Shape of DF:',df_backtest.shape)
output:
Approach 1: 4.617 seconds per loop
Approach 2: 2.256 seconds per loop
Approach 3: 0.087 seconds per loop
Approach 4: 0.106 seconds per loop
Shape of DF: (1605144, 7)
Hence, it seems approach_3
is the fastest, I assume because it does not use the overhead of 'where' nor managing strings. Any other approach is welcome. Thanks