1

So I know you're never suppose to iterate over a Pandas DataFrame, but I can't find another way around this problem.

I have a bunch of different time series, say they're end-of-day stock prices. They're in a DataFrame like this:

   Ticker Price
0   AAA    10
1   AAA    11
2   AAA    10.5
3   BBB    100
4   BBB    110
5   CCC    60
etc.

For each Ticker, I want to take a variety of models and train them on successively larger batches of data. Specifically, I want to take a model, train it on day1 data, predict day2. Train the same model on day1 and day2, predict day3, etc. For each day, I want to slice up to the day before and predict on that subset [day0:dayN-1].

Essentially I'm implementing sklearn's TimeSeriesSplit, except I'm doing it myself because the models I'm training aren't in sklearn (for example, one model is Prophet).

The idea is I try a bunch of models on a bunch of different Tickers, then I see which models work well for which Tickers.

So my basic code for running one model on all my data looks like:

import pandas as pd


def make_predictions(df):

    res = pd.DataFrame()

    for ticker in df.ticker.unique():
        df_ticker = df[df['ticker'] == ticker]

        for i,_ in df_ticker.iterrows():
            X = df_ticker[0:i]
            X = do_preparations(X)           # do some processing to prepare the data
            m = train_model(X)               # train the model
            forecast = make_predictions(m)   # predict one week

            df_ticker.loc[i,'preds'] = forecast['y'][0]

        res = pd.concat([res,df_ticker])

    return res

But my code runs super slow. Can I speed this up somehow? I can't figure out how I would use .apply() or any of the other common anti-iterating techniques.

Martin Gergov
  • 1,556
  • 4
  • 20
  • 29
DRJ
  • 25
  • 5
  • Welcome to SO! Your options for speedings things up depend to some extent on your functions. For example, could `do_preparations()` be applied to the whole dataframe in advance? And do your models allow for iteratively adding data, rather than re-computing the whole thing each time? – ASGM Feb 06 '20 at 22:08
  • For one, `.iterrrows()` is notoriously sluggish, and so is repeatedly appending and concatenating DataFrames. I do agree with @ASGM, I think we're going to need more information. – AMC Feb 06 '20 at 23:42
  • Hmm, I probably could move ```do_preparations()``` to the outer loop. Unfortunately it removes Ticker data that prevents me from looping over the appropriate data. As for re-computing the model, I structured it that way to prevent future data from leaking into any given training run. Are there functions other than ```.iterrows()``` that might be faster? Are there other ways to build the dataframe appropriately besides appending and concatenating? Either would be very appealing. Thanks for your help! – DRJ Feb 08 '20 at 00:40
  • It's still pretty hard to give advice without knowing what's being done to the rows. As written it's hard to know what the code is doing: `df_bdot` isn't defined, for example. Can you just re-write `do_preparations()` so it doesn't remove ticker data? Do you understand in general how to write array-wise (i.e. apply-compatible) functions? Have you considered `groupby()`? – ASGM Mar 06 '20 at 20:16
  • sorry, that should read "df_ticker". I've edited the code to reflect it. – DRJ Mar 06 '20 at 20:24
  • The problem is that the `train_model()` function cannot accept a dataframe with ticker data. It can only accept an n x 2 input where one column is datetime and the other is the timeseries value. As I understand it, apply-compatible functions are applied row-by-row, and I think I know how to do those. My issue is that I want to apply the function to monotonically increasing subsets of the dataframe. So for each row, apply the function to all previous rows that meet my criteria (in this case, ticker). – DRJ Mar 06 '20 at 20:30

1 Answers1

0

Consider several items:

  • First, avoid quadratic copying by calling pd.concat inside a loop. Instead, build a list/dict of data frames to be concatenated once outside the loop.
  • Second, avoid DataFrame.iterrows since you only use i. Instead, traverse the index.
  • Third, for compactness, avoid unique() with subsequent subset [...]. Instead, use groupby() in a dictionary or list comprehension which may be slightly faster than list.append approach and due to your multiple steps, an inner defined function would be needed.

Inner loop may be unavoidable as you are really running different models.

def make_predictions(df):

   def proc_model(sub_df):

      for i in sub_df.index:
         X = sub_df.loc[0:i]
         X = do_preparations(X)           # do some processing to prepare the data
         m = train_model(X)               # train the model
         forecast = make_predictions(m)   # predict one week

         sub_df.loc[i,'preds'] = forecast['y'][0]

      return sub_df   

   # BUILD DICTIONARY OF DATA FRAMES
   df_dict = {i:proc_model(g) for i, g in df.groupby('ticker')}

   # CONCATENATE DATA FRAMES
   res = pd.concat(df_dict, ignore_index=True)

   return res
Parfait
  • 104,375
  • 17
  • 94
  • 125