0

I'm wondering if I am doing this the correct way - I am new to Python, and tried to figure this out as best I could, but now I've almost completed my project, part of it is painfully slow.

I have pulled down daily OHLC bars, filtered through them to find the gappers, and then I have created a Class which goes over the minute data for those daily gappers, and returns important information which I use later for my backtests, e.g whether we've hit the 200 ema in the pre market.

Here's what my code looks like:

# get the dates for our gaps
import os.path
import glob
import numpy as np
from pathlib import Path

folder = "daily_bars_filtered/*.csv"
df_gapper_list = []
df_intraday_analysis = []
# loop through the daily gappers
for fname in glob.glob(folder)[:13]:
    ticker = Path(fname).stem
    df = pd.read_csv(fname)
    df['ticker'] = ticker
    df_gapper_list.append(df)
    print(f'downloading {ticker}')

    # get the intraday bars data for the entire dates
    file_ = 'intraday_bars_gapped_new/{}.csv'.format(ticker)
    df_minute_bars = pd.read_csv(file_)

    # for the current stocks daily gappers, anaylise the data and return the data (fades, ohlc, market periods, etc)
    for index, row in df.iterrows():
        session_data = SESSION_DATA(
            # pass in the minute data
            df_minute_bars,
            # pass in the daily data and ticker
            row['open'],
            row['high'],
            row['low'],
            row['close'],
            row['date'],
            ticker,
            row['previous close'],
            row['volume']).intraday_data

        df_intraday_analysis.append(session_data)


    final_df = pd.concat(df_intraday_analysis,ignore_index=True)
    display(final_df)
    print(f'length of final_df is {len(final_df)}')
    final_df.to_csv('mikeys-spreadsheet2222.csv', index=False)

And here's what my class looks like:

import pandas as pd
from datetime import datetime, time
from IPython.display import display
import math

class SESSION_DATA:
    def __init__(self, minute_data, open, high, low, close, date, ticker, previous_close, volume):
        self.minute_data = minute_data
        self.date = date
        self.ticker = ticker
        self.intraday_data = []
        self.open = open
        self.high = high
        self.low = low
        self.close = close
        self.previous_close = previous_close
        self.volume = volume
        df = self.minute_data
        
        df_current_day = df[(df['date'] == self.date)]
        df_current_day['time'] = pd.to_datetime(df['time']).dt.time

        self.after_hours_high = GET_TIME_PERIOD_DATA('after_hours', df_current_day).high
        self.after_hours_runner = bigger_smaller(self.after_hours_high, self.previous_close)
        self.pre_market_high = GET_TIME_PERIOD_DATA('pre_market', df_current_day).high
        self.pre_market_high_time = GET_TIME_PERIOD_DATA('pre_market', df_current_day).high_time
        self.new_gapper = new_gapper(self.after_hours_runner, self.early_pre_market_runner)
        self.spike = abs(self.high - self.open)
        df_intraday_data = pd.DataFrame({
            'date': self.date,
            'ticker': self.ticker,
            'open': self.open,
            'high': self.high,
            'low': self.low,
            'close': self.close,
            'prev close': self.previous_close,
            'volume': self.volume,
            'PM hi': self.pre_market_high,
            'PM hi time': self.pre_market_high_time,
            'PM 200 ema hit': HIT_200_EMA('pre_market', df_current_day).hit_200_ema,
            'New gapper': self.new_gapper,
            'Spike': self.spike,
            'Pop over 10%': pop_over_10(self.spike),
            }, index=[0])

        self.intraday_data = df_intraday_data

Is there a better way of achieving what I am doing, maybe without the use of iterrows or using something like numpy?

a7dc
  • 3,323
  • 7
  • 32
  • 50
  • 4
    iterrows() is very slow, See https://stackoverflow.com/a/59413206/543969 – CodeMonkey Jun 10 '22 at 21:07
  • Thank you CodeMonkey that gives me something to go on – a7dc Jun 10 '22 at 21:09
  • 1
    Why would you create a class for each of the entries in the dataframe? Looks pretty inefficient to me, specially to add attributes. You can add attributes directly in the dataframe – Zaero Divide Jun 10 '22 at 21:18
  • Thanks Zaero this also gives me something to explore – a7dc Jun 10 '22 at 21:26
  • 1
    I think for your application it may be better to have a "dataframe manipulator class", that will implement all the fancy methods, but the "tags" are in the underlaying dataframe. I will advise you to check [`eval`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.eval.html) and [`query`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html) as they are tipically faster than other df operations. And as much as you can, avoid using for/iterrows/lambdas and custom functions. Good luck! – Zaero Divide Jun 10 '22 at 21:44

1 Answers1

0

Thanks to CodeMonkey for pointing me in the right direction. I didn't test the speed difference but it's huge, so thank you. I will look into trying Zaero's suggestions in the future when I have time:

# get the dates for our gaps
import os.path
import glob
import numpy as np
import pandas as pd
from pathlib import Path
# import time


def SESSION_TEST(open, high, low, close, date, ticker, previous, volume, df):

        date = pd.to_datetime(date).strftime('%Y-%m-%d')
        current_day = df[df['date'] == date]

        session_data = pd.DataFrame({
            'date': date,
            'ticker': ticker,
            'open': open,
            'high': high,
            'low': low,
            'close': close,
            'prev close': previous,
            'volume': volume,
            'hit 200 ema': HIT_200_EMA('pre_market', current_day),
            }, index=[0])

        return session_data
    

folder = "daily_bars_filtered/*.csv"
df_intraday_analysis = []
# loop through the daily gappers
for fname in glob.glob(folder)[216:]:
    ticker = Path(fname).stem
    df = pd.read_csv(fname, parse_dates=['date'])
    df['ticker'] = ticker
    print(f'getting {ticker}')


    # get the intraday bars data for the entire dates
    file_ = 'intraday_bars_gapped_new/{}.csv'.format(ticker)

    if os.path.exists(file_):
        print(f'the {file_} exists')
        df_minute_bars = pd.read_csv(file_, parse_dates=['date', 'datetime'])
        for row in df.to_dict(orient='records'):
            session_data = SESSION_TEST(
                # pass in the daily data and ticker
                row['open'],
                row['high'],
                row['low'],
                row['close'],
                row['date'],
                ticker,
                row['previous close'],
                row['volume'],
                # pass in the minute data
                df_minute_bars)

            df_intraday_analysis.append(session_data)
            final_df = pd.concat(df_intraday_analysis,ignore_index=True)
            display(final_df)
            final_df.to_csv('mikeys-spreadsheet.csv', index=False)
    else:
        print(f'No such file or directory: {file_}')
a7dc
  • 3,323
  • 7
  • 32
  • 50