1

I am working on a project to try to enhance my understanding of LSTM networks. I am following the steps outlined in this blog post here. My dataset looks like the following:

    Open    High    Low Close   Volume
Date                    
2014-04-21  197.080002  206.199997  194.000000   
204.380005  5258200
2014-04-22  206.360001  219.330002  205.009995   
218.639999  9804700
2014-04-23  216.330002  216.740005  207.000000   
207.990005  7295600
2014-04-24  210.809998  212.800003  203.199997   
207.860001  5495200
2014-04-25  202.000000  206.699997  197.649994   
199.850006  6996700

As you can see this is a small snapshot of TSLA Stock movement.

I understand that with LSTM, this data needs to be reshaped into three dimensions:

  1. Batch Size

  2. Time Steps

  3. Features

My initial idea was to use some sort of medium batch size (to allow for the best generalization). Also, to look back at 10 days of history as the Time Step. Features as Open, High, Low, Volume, Close.

Here is where I am a bit stuck. I have two questions specifically:

  1. What is the approach for breaking the data into the new representation (transforming it)?

  2. How do we take this and split it into the train, test, and validation sets? I am having trouble conceptualizing exactly what is being broken down. My initial thought was to use sklearn:

    train_test_split()

But this does not seem like it will work in this case.

Obviously, once the data has been transformed and then split it is easy building the Keras model. It is just a matter of calling fit.(data).

Any suggestions or resources (pointing in the right direction) would be greatly appreciated.

My current code is:

from sklearn.model_selection import train_test_split 

# Split the Data into Training and Testing Data
tsla_train, tsla_test = train_test_split(tsla)

tsla_train.shape
tsla_test.shape

from sklearn.preprocessing import MinMaxScaler

# Scale the Data
scaler = MinMaxScaler()

scaler.fit(tsla_train)

tsla_train_scaled = scaler.transform(tsla_train)
tsla_test_scaled = scaler.transform(tsla_test)

# Define the parameters of the model

batch_size = 20

# Set the model to look back on four days of historical data and 
try to predict the fifth
time_steps = 10

from keras.models import Sequential
from keras.layers import LSTM, Dense

lstm_model = Sequential()

There is some explanation found in this post here.

QFII
  • 71
  • 9

1 Answers1

0

The train_test_split function would indeed not give the desired results here. It assumes that each row is an independent data point, which is not the case since you're using a single time series.

The most common option would be to use earlier data points for training and later data points for testing (and a range of points in the middle for validation if applicable), which would give you the same results as if you had used all the available data for training on the last day in the training set and actually used it for predictions on the following days.

Once you have the data sets split, then the idea is that each training batch will need to have the inputs and corresponding outputs for a randomly selected set of date ranges, where each input is the chosen number of days of historical data (i.e. days × features, with the full batch being batch size × days × features) and the output is just the data for the next day,

Hopefully that helps with some of the intuition behind the procedure. The article you linked has examples of most of the code you would need--it's going to be pretty dense, but I would recommend trying to go line by line and understand everything it's doing, possibly even just typing it out verbatim.

lehiester
  • 836
  • 2
  • 7
  • 17