I am trying to understand how to correctly feed data into my keras model to classify multivariate time series data into three classes using a LSTM neural network.
I looked at different resources already - mainly these three excellent blog posts by Jason Brownlee post1, post2, post3), other SO questions and different papers, but none of the information given there exactly fits my problem case, and I was not able to figure out if my data preprocessing / feeding it into the model is correct, so I guessed I might get some help if I specify my exact conditions here.
What I am trying to do is classify multivariate time series data, which in its original form is structured as follows:
I have 200 samples
One sample is one csv file.
A sample can have 1 to 50 features (i.e. the csv file has 1 to 50 columns).
Each feature has its value "tracked" over a fixed amount of time steps, let's say 100 (i.e. each csv file has exactly 100 rows).
Each csv file has one of three classes ("good", "too small", "too big")
So what my current status looks like is the following:
I have a numpy array "samples" with the following structure:
# array holding all samples
[
# sample 1
[
# feature 1 of sample 1
[ 0.1, 0.2, 0.3, 0.2, 0.3, 0.1, 0.2, 0.4, 0.5, 0.1, ... ], # "time series" of feature 1
# feature 2 of sample 1
[ 0.5, 0.6, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1, -0.1, -0.2, ... ], # "time series" of feature 2
... # up to 50 features
],
# sample 2
[
# feature 1 of sample 2
[ 0.1, 0.2, 0.3, 0.2, 0.3, 0.1, 0.2, 0.4, 0.5, 0.1, ... ], # "time series" of feature 1
# feature 2 of sample 2
[ 0.5, 0.6, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1, -0.1, -0.2, ... ], # "time series" of feature 2
... # up to 50 features
],
... # up to sample no. 200
]
I also have a numpy array "labels" with the same length as the "samples" array (i.e. 200). The labels are encoded in the following way:
- "good" = 0
- "too small" = 1
- "too big" = 2
[0, 2, 2, 1, 0, 1, 2, 0, 0, 0, 1, 2, ... ] # up to label no. 200
This "labels" array is then encoded with keras' to_categorical
function
to_categorical(labels, len(np.unique(labels)))
My model definition currently looks like that:
max_nb_features = 50
nb_time_steps = 100
model = Sequential()
model.add(LSTM(5, input_shape=(max_nb_features, nb_time_steps)))
model.add(Dense(3, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
- The 5 units in the LSTM layer are just randomly picked for now
- 3 Output neurons in the dense layer for my three classes
I then split the data into training / testing data:
samples_train, samples_test, labels_train, labels_test = train_test_split(samples, labels, test_size=0.33)
This leaves us with 134 samples for training and 66 samples for testing.
The problem I'm currenty running into, is that the following code is not working:
model.fit(samples_train, labels_train, epochs=1, batch_size=1)
The error is the following:
Traceback (most recent call last):
File "lstm_test.py", line 152, in <module>
model.fit(samples_train, labels_train, epochs=1, batch_size=1)
File "C:\Program Files\Python36\lib\site-packages\keras\models.py", line 1002, in fit
validation_steps=validation_steps)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 1630, in fit
batch_size=batch_size)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 1476, in _standardize_user_data
exception_prefix='input')
File "C:\Program Files\Python36\lib\site-packages\keras\engine\training.py", line 113, in _standardize_input_data
'with shape ' + str(data_shape))
ValueError: Error when checking input: expected lstm_1_input to have 3 dimensions, but got array with shape (134, 1)
For me, it seems to not work because of the variable amount of features my samples can have. If I use "fake" (generated) data, where all parameters are the same, except each sample has exactly the same amount of features (50), the code works.
Now what I'm trying to understand is:
- Are my general assumptions on how I structured my data for the LSTM input correct? Are the parameters (
batch_size
,input_shape
) correct / sensible? - Is the keras LSTM model in general able to handle samples with different amount of features?
- If yes, how do I have to adapt my code for it to work with different amount of features?
- If no, would "zero-padding" (filling) the columns in the samples with less than 50 features work? Are there other, preferred methods of achieving my goal?