So I know you're never suppose to iterate over a Pandas DataFrame, but I can't find another way around this problem.
I have a bunch of different time series, say they're end-of-day stock prices. They're in a DataFrame like this:
Ticker Price
0 AAA 10
1 AAA 11
2 AAA 10.5
3 BBB 100
4 BBB 110
5 CCC 60
etc.
For each Ticker, I want to take a variety of models and train them on successively larger batches of data. Specifically, I want to take a model, train it on day1 data, predict day2. Train the same model on day1 and day2, predict day3, etc. For each day, I want to slice up to the day before and predict on that subset [day0:dayN-1].
Essentially I'm implementing sklearn's TimeSeriesSplit, except I'm doing it myself because the models I'm training aren't in sklearn (for example, one model is Prophet).
The idea is I try a bunch of models on a bunch of different Tickers, then I see which models work well for which Tickers.
So my basic code for running one model on all my data looks like:
import pandas as pd
def make_predictions(df):
res = pd.DataFrame()
for ticker in df.ticker.unique():
df_ticker = df[df['ticker'] == ticker]
for i,_ in df_ticker.iterrows():
X = df_ticker[0:i]
X = do_preparations(X) # do some processing to prepare the data
m = train_model(X) # train the model
forecast = make_predictions(m) # predict one week
df_ticker.loc[i,'preds'] = forecast['y'][0]
res = pd.concat([res,df_ticker])
return res
But my code runs super slow. Can I speed this up somehow? I can't figure out how I would use .apply() or any of the other common anti-iterating techniques.