0

I am aware there are some other similar questions but I haven't found a timeit test with a fairly large dataset. I have a ~1.6M rows DF , and I want to use the fastest way to assign a column ['stag'] a 0-1 value depending on the index weekday and hour.

Is there a faster approach I am missing ? suggestions ?

import timeit
import statistics

...
...

df_backtest['weekday'] = df_backtest.index.weekday

def approach_1():
    df_backtest["stag"]=1
    df_backtest.loc[(df_backtest.index.hour>=2) & (df_backtest.index.hour<=10),"stag" ]=0
    df_backtest.loc[df_backtest.index.strftime('%w')==6,"stag" ]=0
    df_backtest.loc[df_backtest.index.strftime('%w')==0,"stag" ]=0


def approach_2():
    df_backtest['stag'] = 1
    df_backtest.loc[df_backtest.index.hour.isin(range(2, 11)), 'stag'] = 0
    df_backtest.loc[df_backtest.index.strftime('%w').isin(['6', '0']), 'stag'] = 0
    

def approach_3():
    df_backtest['stag'] = 1
    df_backtest.loc[df_backtest.index.hour.isin(range(2, 11)), 'stag'] = 0
    df_backtest.loc[df_backtest['weekday'].isin(['6', '0']), 'stag'] = 0


def approach_4():
    df_backtest['stag'] = 1
    df_backtest['stag'] = df_backtest['stag'].where((df_backtest.index.hour < 2) | (df_backtest.index.hour > 10))
    df_backtest['stag'] = df_backtest['stag'].where(~df_backtest['weekday'].isin(['6', '0']))

num_repeats = 10
num_loops = 5

print(f'Approach 1: {statistics.mean(timeit.repeat(approach_1, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print(f'Approach 2: {statistics.mean(timeit.repeat(approach_2, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print(f'Approach 3: {statistics.mean(timeit.repeat(approach_3, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print(f'Approach 4: {statistics.mean(timeit.repeat(approach_4, number=num_loops, repeat=num_repeats)) / num_loops} seconds per loop')
print('Shape of DF:',df_backtest.shape)

output:

Approach 1: 4.617 seconds per loop
Approach 2: 2.256 seconds per loop
Approach 3: 0.087 seconds per loop
Approach 4: 0.106 seconds per loop
Shape of DF: (1605144, 7)

Hence, it seems approach_3 is the fastest, I assume because it does not use the overhead of 'where' nor managing strings. Any other approach is welcome. Thanks

Lorenzo Bassetti
  • 795
  • 10
  • 15
  • I am not sure if [a minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) is possible to be shared here, but could you please tell us what `df_backtest` contains i.e., the columns that are involved in computation and their data types? – medium-dimensional Jan 13 '23 at 05:02
  • Please check [How do I assign values based on multiple conditions on existing columns?](https://stackoverflow.com/questions/30631841/how-do-i-assign-values-based-on-multiple-conditions-for-existing-columns/73728391#73728391). – medium-dimensional Jan 13 '23 at 05:05

0 Answers0