1

I am having a hard time understand the inner workings of LSTM in Pytorch.

Let me show you a toy example. Maybe the architecture does not make much sense, but I am trying to understand how LSTM works in this context.

The data can be obtained from here. Each row i (total = 1152) is a slice, starting from t = i until t = i + 91, of a longer time series. I will extract the last column of each row to use as labels.

import torch
import numpy as np
import pandas as pd
from torch import nn, optim
from sklearn.metrics import mean_absolute_error

data = pd.read_csv('data.csv', header = None).values
X = torch.tensor(data[:, :90], dtype = torch.float).view(1152, 1, 90)
y = torch.tensor(data[:, 90], dtype = torch.float).view(1152, 1, 1)

dataset = torch.utils.data.TensorDataset(X, y)
loader = torch.utils.data.DataLoader(dataset, batch_size = 50)

Then I am defining an LSTM regressor containing three LSTM layers with different structures.

class regressor_LSTM(nn.Module):
    def __init__(self):
        super().__init__()
        self.lstm1 = nn.LSTM(input_size = 49, hidden_size = 100)
        self.lstm2 = nn.LSTM(100, 50)
        self.lstm3 = nn.LSTM(50, 50, dropout = 0.3, num_layers = 2)
        self.dropout = nn.Dropout(p = 0.3)
        self.linear = nn.Linear(in_features = 50, out_features = 1)

    def forward(self, X):
        X, _ = self.lstm1(X)
        X = self.dropout(X)
        X, _ = self.lstm2(X)
        X = self.dropout(X)
        X, _ = self.lstm3(X)
        X = self.dropout(X)
        X = self.linear(X)

        return X

Initializing what needs to be initialized:

regressor = regressor_LSTM()
criterion = nn.MSELoss()
optimizer = optim.RMSprop(regressor.parameters())

Then training:

for epoch in range(25):
    acc_loss = 0.
    acc_mae = 0.   
    for i, data in enumerate(loader):
        inputs, labels = data
        optimizer.zero_grad()
        outputs = regressor(inputs)
        loss = criterion(outputs, labels)
        loss.backward(retain_graph = True)
        optimizer.step()
        acc_loss += loss.item()
        mae = mean_absolute_error(labels.detach().cpu().numpy().flatten(), outputs.detach().cpu().numpy().flatten())
        acc_mae += mae
#       print('\rEPOCH {:3d} - Loop {:3d} of {:3d}: loss {:03.2f} - MAE {:03.2f}'.format(epoch+1, i+1, len(loader), loss, mae), end = '\r')
    print('\nEPOCH %3d FINISHED: loss %.5f - MAE %.5f' % (epoch+1, acc_loss/len(loader), acc_mae/len(loader)))

The thing is, after some initial decrease in both loss and MAE (expected behavior), both seem to get stuck (showing only first 10 epochs below):


EPOCH   1 FINISHED: loss 0.38506 - MAE 0.27322          
EPOCH   2 FINISHED: loss 0.02825 - MAE 0.13601          
EPOCH   3 FINISHED: loss 0.02593 - MAE 0.13117          
EPOCH   4 FINISHED: loss 0.02568 - MAE 0.12705          
EPOCH   5 FINISHED: loss 0.02546 - MAE 0.12920          
EPOCH   6 FINISHED: loss 0.02502 - MAE 0.12763          
EPOCH   7 FINISHED: loss 0.02445 - MAE 0.12659          
EPOCH   8 FINISHED: loss 0.02310 - MAE 0.12328          
EPOCH   9 FINISHED: loss 0.02277 - MAE 0.12237          
EPOCH  10 FINISHED: loss 0.02352 - MAE 0.12476

When run with Keras, both metrics decrease consistently throughout the process. (I also noticed Keras takes much longer.)

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM
import pandas as pd

data = pd.read_csv('data.csv', header = None).values
X = data[:, :90].reshape(1152, 90, 1)
y = data[:, 90]

regressor = Sequential()
regressor.add(LSTM(units = 100, return_sequences = True, input_shape = (90, 1)))
regressor.add(Dropout(0.3))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.3))
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.3))
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.3))
regressor.add(Dense(units = 1, activation = 'linear'))
regressor.compile(optimizer = 'rmsprop', loss = 'mean_squared_error', metrics = ['mean_absolute_error'])
regressor.fit(X, y, epochs = 25, batch_size = 32)
[OUTPUT]
Epoch 1/25
1152/1152 - 35s 30ms/sample - loss: 0.0307 - mean_absolute_error: 0.1225
Epoch 2/25
1152/1152 - 32s 28ms/sample - loss: 0.0156 - mean_absolute_error: 0.0978
Epoch 3/25
1152/1152 - 32s 28ms/sample - loss: 0.0126 - mean_absolute_error: 0.0871
Epoch 4/25
1152/1152 - 34s 30ms/sample - loss: 0.0111 - mean_absolute_error: 0.0806
Epoch 5/25
1152/1152 - 29s 25ms/sample - loss: 0.0103 - mean_absolute_error: 0.0785
Epoch 6/25
1152/1152 - 29s 25ms/sample - loss: 0.0088 - mean_absolute_error: 0.0718
Epoch 7/25
1152/1152 - 32s 27ms/sample - loss: 0.0085 - mean_absolute_error: 0.0699
Epoch 8/25
1152/1152 - 30s 26ms/sample - loss: 0.0069 - mean_absolute_error: 0.0640
Epoch 9/25
1152/1152 - 30s 26ms/sample - loss: 0.0077 - mean_absolute_error: 0.0660
Epoch 10/25
1152/1152 - 30s 26ms/sample - loss: 0.0070 - mean_absolute_error: 0.0644

I've been reading about hidden state initialization, I tried to set them to 0 in the beginning of the forward method (which, though, I understood to be the standard behavior), but nothing helped. I must confess that I do not understand what the parameters of an LSTM are, nor which should be reinitialized (if any) after each batch or epoch.

I appreciate any return!

Denny Ceccon
  • 145
  • 9
  • I think this is a matter of concept on how do hidden states should be moved across the several LSTM layers. I suggest you look at [my long answer on what is the proper way on retrieving hidden states in PyTorch](https://stackoverflow.com/a/56683970/7347631). – ndrwnaguib Oct 05 '19 at 22:49
  • Amazing explanation Andrew, I just keep wondering how to structure the whole thing. 1) When I call the forward method and pass it a batch, does that mean that all timesteps go through the model, h and c across time being updated accordingly? 2) Then, should I pass h and c from my previous LSTM layers to the next ones? But how do I do that if my LSTM layers have different sizes? 3) And should I reset h and c when forward is called again? (If so, doesn't the module do that for me automatically?) – Denny Ceccon Oct 06 '19 at 22:01

1 Answers1

1

I am coming back after a few days because I have come to a conclusion. After reading some material on hidden/cell states, (this one was quite useful) it seems that reusing them is a matter of net design choice. Whether doing so, and when, can count as a hyperparameter. I tried many options with my toy dataset, mainly resetting the states after each batch, resetting after each epoch, and not resetting at all, and the results were quite similar. Also, my results were so low because (as I believe) I did not choose shuffle = True in the loader; doing so made them considerably better (loss around 0.003, MAE around 0.047).

In the original code for the LSTM class, line 510, it also seems that hidden/cell states are initiated at zero if no values are explicitly passed.

Denny Ceccon
  • 145
  • 9