Efficiency improvement when looping over a Dataframe

Question

I have a python Dataframe, which consists of 2000 rows (each row represents a day). Then, I tried to use a for-loop to loop over the Dataframe[0:252] and calculate some signals and save to a list, and then I need to do the same on Dataframe[1:253], and on Dataframe[2:254] until Dataframe[1746:2000].

This took me around 15 minutes on one file which consists of 2000 rows and I have 1000 copies of that. Does anyone give me some hints on how I can perform better?

Thank you.

Some codes samples are as follow:

data = pd.read_excel("17 HK Stocks in different sheets.xlsx",sheet_name=Sheet_Name,engine='openpyxl',parse_dates=True,index_col='Date')

# Start to divide the data into each year and loop over
for adjusting_factor in range(len(data)):
    data = pd.read_excel("17 HK Stocks in different sheets.xlsx",sheet_name=Sheet_Name,engine='openpyxl',parse_dates=True,index_col='Date')
    data['Date'] = pd.to_datetime(data.index)
    data['Date'] = data['Date'].apply(mpl_dates.date2num)
    data = data.loc[:,['Date', 'Open', 'High', 'Low', 'Close', 'Volume']]
    
    
    data['Indication_Close_Rule'] = 0
    data['Resistance_Validation'] = 0
    data['Higher_Contracting_Range'] = 0
    #data['Volume_Rolling_50'] = data['Volume'].rolling(50).mean() 
    #data['Volume_Validation'] = 0
    data['VCP_Signal'] = 0
    data['VCP_Signal_Change'] = 0
    total_numbers_signal = 0


    
    # This is the rolling of the dataframe in a range of 252 days
    data = data[(0 + adjusting_factor):252+adjusting_factor]  
    
    if len(data) < 252:
        break


    for j in range(4,data.shape[0]):


        if data['High'][j] < data['High'][j-1] and data['High'][j-1] < data['High'][j-2] and data['High'][j-2] > data['High'][j-3] and data['High'][j-3] > data['High'][j-4] :

            if len(zigzag_list) == 0:
                date_carrier_high = str(data.index.values[j-2])            
                zigzag_list.append((data['High'][j-2],j-2,'H',date_carrier_high[0:10]))
                Current_Three_Points_For_Top.append((data['High'][j-2],j-2+adjusting_factor,'H',date_carrier_high[0:10]))

            elif len(zigzag_list) != 0:
                if zigzag_list[-1][2] == 'L' and zigzag_list[-1][0] < data['High'][j-2]:
                    date_carrier_high = str(data.index.values[j-2])
                    zigzag_list.append((data['High'][j-2],j-2,'H',date_carrier_high[0:10]))
                    Current_Three_Points_For_Top.append((data['High'][j-2],j-2+adjusting_factor,'H',date_carrier_high[0:10]))

                elif zigzag_list[-1][2] == 'L' and zigzag_list[-1][0] > data['High'][j-2]:
                    pass

                elif zigzag_list[-1][2] == 'H' and zigzag_list[-1][0] > data['High'][j-2]:
                    pass

                elif zigzag_list[-1][2] == 'H' and zigzag_list[-1][0] < data['High'][j-2]:
                    zigzag_list = zigzag_list[:-1]
                    date_carrier_high = str(data.index.values[j-2])
                    zigzag_list.append((data['High'][j-2],j-2,'H',date_carrier_high[0:10]))
                    Current_Three_Points_For_Top = Current_Three_Points_For_Top[:-1]
                    Current_Three_Points_For_Top.append((data['High'][j-2],j-2+adjusting_factor,'H',date_carrier_high[0:10]))


        elif data['High'][j] > data['High'][j-1] and data['High'][j-1] < data['High'][j-2] and data['High'][j-2] > data['High'][j-3] and data['High'][j-3] > data['High'][j-4] :                                           

            if len(zigzag_list) == 0:
                date_carrier_high = str(data.index.values[j-2])            
                zigzag_list.append((data['High'][j-2],j-2,'H',date_carrier_high[0:10]))
                Current_Three_Points_For_Top.append((data['High'][j-2],j-2+adjusting_factor,'H',date_carrier_high[0:10]))

            elif len(zigzag_list) != 0:
                if zigzag_list[-1][2] == 'L' and zigzag_list[-1][0] < data['High'][j-2]:
                    date_carrier_high = str(data.index.values[j-2])
                    zigzag_list.append((data['High'][j-2],j-2,'H',date_carrier_high[0:10]))
                    Current_Three_Points_For_Top.append((data['High'][j-2],j-2+adjusting_factor,'H',date_carrier_high[0:10]))

                elif zigzag_list[-1][2] == 'L' and zigzag_list[-1][0] > data['High'][j-2]:
                    pass

                elif zigzag_list[-1][2] == 'H' and zigzag_list[-1][0] > data['High'][j-2]:
                    pass

                elif zigzag_list[-1][2] == 'H' and zigzag_list[-1][0] < data['High'][j-2]:
                    zigzag_list = zigzag_list[:-1]
                    date_carrier_high = str(data.index.values[j-2])
                    zigzag_list.append((data['High'][j-2],j-2,'H',date_carrier_high[0:10]))
                    Current_Three_Points_For_Top = Current_Three_Points_For_Top[:-1]
                    Current_Three_Points_For_Top.append((data['High'][j-2],j-2+adjusting_factor,'H',date_carrier_high[0:10]))

can you give us some code? Becuse now you are asking for a complete new code, not for help on improving your code. — Matthijs990, Dec 02 '20 at 09:18
What sort of calculations are you doing? We need to see some sort of example if you want help. If you're doing something like `result = sum(df[0:252])`, then for the next iteration why not do `result = result - df[0] + df[252]`? It cuts out a whole load of calculations — JolonB, Dec 02 '20 at 09:18
what about using [pandas.DataFrame.rolling()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html)? — buran, Dec 02 '20 at 09:20
Thank you @Matthijs990, please see the example above I just editted. — Jacky Law, Dec 02 '20 at 09:39
Thank you @JolonB, please see the example above I just editted. — Jacky Law, Dec 02 '20 at 09:39
There are a few things you could do to just tidy up your code (this isn't really related to the question though :)). Firstly, get rid of `if` statements that just lead to a pass. Also, you don't need `elif len(zigzag_list) != 0:`. Just use `else:`. There's no reason to evaluate the opposite condition from what you already evaluated for the if statement — JolonB, Dec 02 '20 at 09:45
@buran This methods seems do not work if I need to perform some custom computation? — Jacky Law, Dec 02 '20 at 09:55
look at https://stackoverflow.com/questions/60736556/pandas-rolling-apply-using-multiple-columns for rolling calculations which use multiple columns — buran, Dec 02 '20 at 10:22

Mark · Answer 1 · 2020-12-02T09:59:12.447

0

As someone suggested in the comments, use rolling() method on dataframe.

Example:

import io
import pandas as pd
import numpy as np

# Create dataframe
s = """
A B C
1 5 7
5 9 2
3 8 7
2 8 3
2 6 4
"""
df = pd.read_csv(io.StringIO(s), delimiter=' ')


rolled = df.rolling(window=2)

# your custom function to apply to the window
def f(arr):
    # change this
    return np.sum(arr)

print(df)
print(rolled.apply(f).dropna())

Output:

   A  B  C
0  1  5  7
1  5  9  2
2  3  8  7
3  2  8  3
4  2  6  4

     A     B     C
1  6.0  14.0   9.0
2  8.0  17.0   9.0
3  5.0  16.0  10.0
4  4.0  14.0   7.0

edited Dec 02 '20 at 09:59

answered Dec 02 '20 at 09:39

Mark

532
2
6

Yes, `df.rolling(2).sum()` is correct, for a simple sum. However, OP mentions "calculate some signals" - how would this operate if OP needed to use a custom function? – S3DEV Dec 02 '20 at 09:47
I was gonna ask the same too. – Jacky Law Dec 02 '20 at 09:54
@JackyLaw I edited my answer to include `apply`. Is that what you meant? – Mark Dec 02 '20 at 10:04
Mark - Apologies, I should have been more specific. Try a more complex calc in the function - sum will rarely throw errors. Try exponential, log, sin/cos ... something still simple, but more complex. – S3DEV Dec 02 '20 at 10:10
@S3DEV you mean window_shape? – Mark Dec 02 '20 at 10:14
Thank you. But would you mind explain why using rolling function is more efficient than data = data[0:252] ? – Jacky Law Dec 03 '20 at 04:51

Efficiency improvement when looping over a Dataframe

1 Answers1