Pandas: Timeseries using predictions as ground truth

Question

I'm learning about time series and am trying to predict closing stock price for the next two weeks, given the data I already have (about a year).

I've created 7 lag features using Pandas shift, so I have features t-7, t-6, ..., t-1 and the current day's closing stock price for my whole DataFrame, df. I've made a test_df which is just the last two weeks of data. test_df has the true values for each of its row's lagged features.

I want to mimic predicting future values by limiting myself to values from my training set (everything in df before the last two weeks) and my predictions.

So I was going to do something like:

# for each row in test_df
    # prediction = model.predict(row)
    # row["t"] = prediction

I think this is close, but it doesn't fix other lagged features like t-1, t-2, ..., t-7. I need to do this:

row 2, t = prediction for row 1
row 2, t-1 = t for row 1
...
row 2, t-i = t-i+1 for row 1

And I would repeat this for all rows in my test_df.

I could do this by writing my own function, but I'm wondering if there's a way to take advantage of Pandas to do this more easily.

Edit for clarity:

Suppose I'm looking at my first test row. I don't have the closing_price, so I use my model to predict based on the lagged features. Before prediction, my df looks like this:

  closing_price  t-1  t-2  t-3  t-4  t-5
0          None    7    6    5    4    3

Suppose my prediction for closing_price is 15. Then my updated DataFrame should look like this:

   closing_price   t-1  t-2  t-3  t-4  t-5
0           15.0   7.0  6.0  5.0  4.0  3.0
1            NaN  15.0  7.0  6.0  5.0  4.0

Thanks!

dataframes have a `.shift` method that is quite handy for situations like this and let you avoid writing loops — Paul H, Mar 08 '18 at 20:12
@PaulH Thanks, I used `.shift` to create my lagged features initially. Not 100% sure how I would use it here though? — anon_swe, Mar 08 '18 at 20:14
you need to add more detail to your question then. it's currently unclear what you're trying to achieve. You should include 10 - 15 rows and few columns of your existing dataframe, and your desired output (computed and typed out by hand, if needed) — Paul H, Mar 08 '18 at 20:21
@PaulH Thanks for the guidance. Just updated to make it clearer. Let me know if that helps! — anon_swe, Mar 08 '18 at 21:10

Seth Rothschild · Answer 1 · 2018-03-09T04:46:02.960

Edited: So you won't actually need time series split for this at all, since you're only trying to predict the value for one row. It seems you know how to create the shifted dataframe, so suppose you've stored your train data in a dataframe df where the 'closing_price' element of the last row is None. You'll use:

Xtrain = df[:-1]
ytrain = Xtrain.pop('closing_price')
Xtest = df.tail(1)
Xtest.pop('closing_price')
reg.fit(Xtrain, ytrain)
prediction = reg.predict(Xtest)

From there you can either put the prediction into your existing dataframe with df.set_value or make a new dataframe altogether if you're doing this incrementally.

If I'm understanding your question correctly (please comment if I'm not!), I think you're looking for the scikit-learn Time Series Split. That will let you create multiple predictions at different points in time using only historical data.

From their documentation:

from sklearn.model_selection import TimeSeriesSplit
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
tscv = TimeSeriesSplit(n_splits=3)
print(tscv)  

for train_index, test_index in tscv.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

Hi Seth, This is very nearly what I want. However, I want to include data after the month I'm predicting on as well for training purposes... — anon_swe, Mar 08 '18 at 20:59
Hm. That seems a bit problematic from the point of view of label leakage: wouldn't you be using future information to help predict what happened in the past? If there's a reason to not worry about it, you could also try the regular [cross validation package](http://scikit-learn.org/stable/modules/cross_validation.html) — Seth Rothschild, Mar 08 '18 at 21:04

Pandas: Timeseries using predictions as ground truth

1 Answers1