I am having a hard time understand the inner workings of LSTM in Pytorch.
Let me show you a toy example. Maybe the architecture does not make much sense, but I am trying to understand how LSTM works in this context.
The data can be obtained from here. Each row i
(total = 1152) is a slice, starting from t = i
until t = i + 91
, of a longer time series. I will extract the last column of each row to use as labels.
import torch
import numpy as np
import pandas as pd
from torch import nn, optim
from sklearn.metrics import mean_absolute_error
data = pd.read_csv('data.csv', header = None).values
X = torch.tensor(data[:, :90], dtype = torch.float).view(1152, 1, 90)
y = torch.tensor(data[:, 90], dtype = torch.float).view(1152, 1, 1)
dataset = torch.utils.data.TensorDataset(X, y)
loader = torch.utils.data.DataLoader(dataset, batch_size = 50)
Then I am defining an LSTM regressor containing three LSTM layers with different structures.
class regressor_LSTM(nn.Module):
def __init__(self):
super().__init__()
self.lstm1 = nn.LSTM(input_size = 49, hidden_size = 100)
self.lstm2 = nn.LSTM(100, 50)
self.lstm3 = nn.LSTM(50, 50, dropout = 0.3, num_layers = 2)
self.dropout = nn.Dropout(p = 0.3)
self.linear = nn.Linear(in_features = 50, out_features = 1)
def forward(self, X):
X, _ = self.lstm1(X)
X = self.dropout(X)
X, _ = self.lstm2(X)
X = self.dropout(X)
X, _ = self.lstm3(X)
X = self.dropout(X)
X = self.linear(X)
return X
Initializing what needs to be initialized:
regressor = regressor_LSTM()
criterion = nn.MSELoss()
optimizer = optim.RMSprop(regressor.parameters())
Then training:
for epoch in range(25):
acc_loss = 0.
acc_mae = 0.
for i, data in enumerate(loader):
inputs, labels = data
optimizer.zero_grad()
outputs = regressor(inputs)
loss = criterion(outputs, labels)
loss.backward(retain_graph = True)
optimizer.step()
acc_loss += loss.item()
mae = mean_absolute_error(labels.detach().cpu().numpy().flatten(), outputs.detach().cpu().numpy().flatten())
acc_mae += mae
# print('\rEPOCH {:3d} - Loop {:3d} of {:3d}: loss {:03.2f} - MAE {:03.2f}'.format(epoch+1, i+1, len(loader), loss, mae), end = '\r')
print('\nEPOCH %3d FINISHED: loss %.5f - MAE %.5f' % (epoch+1, acc_loss/len(loader), acc_mae/len(loader)))
The thing is, after some initial decrease in both loss and MAE (expected behavior), both seem to get stuck (showing only first 10 epochs below):
EPOCH 1 FINISHED: loss 0.38506 - MAE 0.27322
EPOCH 2 FINISHED: loss 0.02825 - MAE 0.13601
EPOCH 3 FINISHED: loss 0.02593 - MAE 0.13117
EPOCH 4 FINISHED: loss 0.02568 - MAE 0.12705
EPOCH 5 FINISHED: loss 0.02546 - MAE 0.12920
EPOCH 6 FINISHED: loss 0.02502 - MAE 0.12763
EPOCH 7 FINISHED: loss 0.02445 - MAE 0.12659
EPOCH 8 FINISHED: loss 0.02310 - MAE 0.12328
EPOCH 9 FINISHED: loss 0.02277 - MAE 0.12237
EPOCH 10 FINISHED: loss 0.02352 - MAE 0.12476
When run with Keras, both metrics decrease consistently throughout the process. (I also noticed Keras takes much longer.)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
import pandas as pd
data = pd.read_csv('data.csv', header = None).values
X = data[:, :90].reshape(1152, 90, 1)
y = data[:, 90]
regressor = Sequential()
regressor.add(LSTM(units = 100, return_sequences = True, input_shape = (90, 1)))
regressor.add(Dropout(0.3))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.3))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.3))
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.3))
regressor.add(Dense(units = 1, activation = 'linear'))
regressor.compile(optimizer = 'rmsprop', loss = 'mean_squared_error', metrics = ['mean_absolute_error'])
regressor.fit(X, y, epochs = 25, batch_size = 32)
[OUTPUT]
Epoch 1/25
1152/1152 - 35s 30ms/sample - loss: 0.0307 - mean_absolute_error: 0.1225
Epoch 2/25
1152/1152 - 32s 28ms/sample - loss: 0.0156 - mean_absolute_error: 0.0978
Epoch 3/25
1152/1152 - 32s 28ms/sample - loss: 0.0126 - mean_absolute_error: 0.0871
Epoch 4/25
1152/1152 - 34s 30ms/sample - loss: 0.0111 - mean_absolute_error: 0.0806
Epoch 5/25
1152/1152 - 29s 25ms/sample - loss: 0.0103 - mean_absolute_error: 0.0785
Epoch 6/25
1152/1152 - 29s 25ms/sample - loss: 0.0088 - mean_absolute_error: 0.0718
Epoch 7/25
1152/1152 - 32s 27ms/sample - loss: 0.0085 - mean_absolute_error: 0.0699
Epoch 8/25
1152/1152 - 30s 26ms/sample - loss: 0.0069 - mean_absolute_error: 0.0640
Epoch 9/25
1152/1152 - 30s 26ms/sample - loss: 0.0077 - mean_absolute_error: 0.0660
Epoch 10/25
1152/1152 - 30s 26ms/sample - loss: 0.0070 - mean_absolute_error: 0.0644
I've been reading about hidden state initialization, I tried to set them to 0 in the beginning of the forward method (which, though, I understood to be the standard behavior), but nothing helped. I must confess that I do not understand what the parameters of an LSTM are, nor which should be reinitialized (if any) after each batch or epoch.
I appreciate any return!