RNN model (GRU) of word2vec to regression not learning

Question

I am converting Keras code into PyTorch because I am more familiar with the latter than the former. However, I found that it is not learning (or only barely).

Below I have provided almost all of my PyTorch code, including the initialisation code so that you can try it out yourself. The only thing you would need to provide yourself, is the word embeddings (I'm sure you can find many word2vec models online). The first input file should be a file with tokenised text, the second input file should be a file with floating-point numbers, one per line. Because I have provided all the code, this question may seem huge and too broad. However, my question is specific enough I think: what is wrong in my model or training loop that causes my model to not or barely improve. (See below for results.)

I have tried to provide many comments where applicable, and I have provided the shape transformations as well so you do not have to run the code to see what is going on. The data prep methods are not important to inspect.

The most important parts are the forward method of the RegressorNet, and the training loop of RegressionNN (admittedly, these names were badly chosen). I think the mistake is there somewhere.

from pathlib import Path
import time

import numpy as np
import torch
from torch import nn, optim
from torch.utils.data import DataLoader
import gensim

from scipy.stats import pearsonr

from LazyTextDataset import LazyTextDataset


class RegressorNet(nn.Module):
    def __init__(self, hidden_dim, embeddings=None, drop_prob=0.0):
        super(RegressorNet, self).__init__()
        self.hidden_dim = hidden_dim
        self.drop_prob = drop_prob

        # Load pretrained w2v model, but freeze it: don't retrain it.
        self.word_embeddings = nn.Embedding.from_pretrained(embeddings)
        self.word_embeddings.weight.requires_grad = False
        self.w2v_rnode = nn.GRU(embeddings.size(1), hidden_dim, bidirectional=True, dropout=drop_prob)

        self.dropout = nn.Dropout(drop_prob)
        self.linear = nn.Linear(hidden_dim * 2, 1)
        # LeakyReLU rather than ReLU so that we don't get stuck in a dead nodes
        self.lrelu = nn.LeakyReLU()

    def forward(self, batch_size, sentence_input):
        # shape sizes for:
        # * batch_size 128
        # * embeddings of dim 146
        # * hidden dim of 200
        # * sentence length of 20

        # sentence_input: torch.Size([128, 20])
        # Get word2vec vector representation
        embeds = self.word_embeddings(sentence_input)
        # embeds: torch.Size([128, 20, 146])

        # embeds.view(-1, batch_size, embeds.size(2)): torch.Size([20, 128, 146])
        # Input vectors into GRU, only keep track of output
        w2v_out, _ = self.w2v_rnode(embeds.view(-1, batch_size, embeds.size(2)))
        # w2v_out = torch.Size([20, 128, 400])

        # Leaky ReLU it
        w2v_out = self.lrelu(w2v_out)

        # Dropout some nodes
        if self.drop_prob > 0:
            w2v_out = self.dropout(w2v_out)
        # w2v_out: torch.Size([20, 128, 400

        # w2v_out[-1, :, :]: torch.Size([128, 400])
        # Only use the last output of a sequence! Supposedly that cell outputs the final information
        regression = self.linear(w2v_out[-1, :, :])
        regression: torch.Size([128, 1])

        return regression


class RegressionRNN:
    def __init__(self, train_files=None, test_files=None, dev_files=None):
        print('Using torch ' + torch.__version__)

        self.datasets, self.dataloaders = RegressionRNN._set_data_loaders(train_files, test_files, dev_files)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        self.model = self.w2v_vocab = self.criterion = self.optimizer = self.scheduler = None

    @staticmethod
    def _set_data_loaders(train_files, test_files, dev_files):
        # labels must be the last input file
        datasets = {
            'train': LazyTextDataset(train_files) if train_files is not None else None,
            'test': LazyTextDataset(test_files) if test_files is not None else None,
            'valid': LazyTextDataset(dev_files) if dev_files is not None else None
        }
        dataloaders = {
            'train': DataLoader(datasets['train'], batch_size=128, shuffle=True, num_workers=4) if train_files is not None else None,
            'test': DataLoader(datasets['test'], batch_size=128, num_workers=4) if test_files is not None else None,
            'valid': DataLoader(datasets['valid'], batch_size=128, num_workers=4) if dev_files is not None else None
        }

        return datasets, dataloaders

    @staticmethod
    def prepare_lines(data, split_on=None, cast_to=None, min_size=None, pad_str=None, max_size=None, to_numpy=False,
                      list_internal=False):
        """ Converts the string input (line) to an applicable format. """
        out = []
        for line in data:
            line = line.strip()
            if split_on:
                line = line.split(split_on)
                line = list(filter(None, line))
            else:
                line = [line]

            if cast_to is not None:
                line = [cast_to(l) for l in line]

            if min_size is not None and len(line) < min_size:
                # pad line up to a number of tokens
                line += (min_size - len(line)) * ['@pad@']
            elif max_size and len(line) > max_size:
                line = line[:max_size]

            if list_internal:
                line = [[item] for item in line]

            if to_numpy:
                line = np.array(line)

            out.append(line)

        if to_numpy:
            out = np.array(out)

        return out

    def prepare_w2v(self, data):
        idxs = []
        for seq in data:
            tok_idxs = []
            for word in seq:
                # For every word, get its index in the w2v model.
                # If it doesn't exist, use @unk@ (available in the model).
                try:
                    tok_idxs.append(self.w2v_vocab[word].index)
                except KeyError:
                    tok_idxs.append(self.w2v_vocab['@unk@'].index)
            idxs.append(tok_idxs)
        idxs = torch.tensor(idxs, dtype=torch.long)

        return idxs

    def train(self, epochs=10):
        valid_loss_min = np.Inf
        train_losses, valid_losses = [], []
        for epoch in range(1, epochs + 1):
            epoch_start = time.time()

            train_loss, train_results = self._train_valid('train')
            valid_loss, valid_results = self._train_valid('valid')

            # Calculate Pearson correlation between prediction and target
            try:
                train_pearson = pearsonr(train_results['predictions'], train_results['targets'])
            except FloatingPointError:
                train_pearson = "Could not calculate Pearsonr"

            try:
                valid_pearson = pearsonr(valid_results['predictions'], valid_results['targets'])
            except FloatingPointError:
                valid_pearson = "Could not calculate Pearsonr"

            # calculate average losses
            train_loss = np.mean(train_loss)
            valid_loss = np.mean(valid_loss)

            train_losses.append(train_loss)
            valid_losses.append(valid_loss)

            # print training/validation statistics
            print(f'----------\n'
                  f'Epoch {epoch} - completed in {(time.time() - epoch_start):.0f} seconds\n'
                  f'Training Loss: {train_loss:.6f}\t Pearson: {train_pearson}\n'
                  f'Validation loss: {valid_loss:.6f}\t Pearson: {valid_pearson}')

            # validation loss has decreased
            if valid_loss <= valid_loss_min and train_loss > valid_loss:
                print(f'!! Validation loss decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}).  Saving model ...')
                valid_loss_min = valid_loss

            if train_loss <= valid_loss:
                print('!! Training loss is lte validation loss. Might be overfitting!')

            # Optimise with scheduler
            if self.scheduler is not None:
                self.scheduler.step(valid_loss)

        print('Done training...')

    def _train_valid(self, do):
        """ Do training or validating. """
        if do not in ('train', 'valid'):
            raise ValueError("Use 'train' or 'valid' for 'do'.")

        results = {'predictions': np.array([]), 'targets': np.array([])}
        losses = np.array([])

        self.model = self.model.to(self.device)
        if do == 'train':
            self.model.train()
            torch.set_grad_enabled(True)
        else:
            self.model.eval()
            torch.set_grad_enabled(False)

        for batch_idx, data in enumerate(self.dataloaders[do], 1):
            # 1. Data prep
            sentence = data[0]
            target = data[-1]
            curr_batch_size = target.size(0)

            # Returns list of tokens, possibly padded @pad@
            sentence = self.prepare_lines(sentence, split_on=' ', min_size=20, max_size=20)
            # Converts tokens into w2v IDs as a Tensor
            sent_w2v_idxs = self.prepare_w2v(sentence)
            # Converts output to Tensor of floats
            target = torch.Tensor(self.prepare_lines(target, cast_to=float))

            # Move input to device
            sent_w2v_idxs, target = sent_w2v_idxs.to(self.device), target.to(self.device)

            # 2. Predictions
            pred = self.model(curr_batch_size, sentence_input=sent_w2v_idxs)
            loss = self.criterion(pred, target)

            # 3. Optimise during training
            if do == 'train':
                self.optimizer.zero_grad()
                loss.backward()
                self.optimizer.step()

            # 4. Save results
            pred = pred.detach().cpu().numpy()
            target = target.cpu().numpy()

            results['predictions'] = np.append(results['predictions'], pred, axis=None)
            results['targets'] = np.append(results['targets'], target, axis=None)
            losses = np.append(losses, float(loss))

        torch.set_grad_enabled(True)

        return losses, results


if __name__ == '__main__':
    HIDDEN_DIM = 200

    # Load embeddings from pretrained gensim model
    embed_p = Path('path-to.w2v_model').resolve()
    w2v_model = gensim.models.KeyedVectors.load_word2vec_format(str(embed_p))
    # add a padding token with only zeros
    w2v_model.add(['@pad@'], [np.zeros(w2v_model.vectors.shape[1])])
    embed_weights = torch.FloatTensor(w2v_model.vectors)


    # Text files are used as input. Every line is one datapoint.
    # *.tok.low.*: tokenized (space-separated) sentences
    # *.cross: one floating point number per line, which we are trying to predict
    regr = RegressionRNN(train_files=(r'train.tok.low.en',
                                      r'train.cross'),
                         dev_files=(r'dev.tok.low.en',
                                    r'dev.cross'),
                         test_files=(r'test.tok.low.en',
                                     r'test.cross'))
    regr.w2v_vocab = w2v_model.vocab
    regr.model = RegressorNet(HIDDEN_DIM, embed_weights, drop_prob=0.2)
    regr.criterion = nn.MSELoss()
    regr.optimizer = optim.Adam(list(regr.model.parameters())[0:], lr=0.001)
    regr.scheduler = optim.lr_scheduler.ReduceLROnPlateau(regr.optimizer, 'min', factor=0.1, patience=5, verbose=True)

    regr.train(epochs=100)

For the LazyTextDataset, you can refer to the class below.

from torch.utils.data import Dataset

import linecache


class LazyTextDataset(Dataset):
    def __init__(self, paths):
        # labels are in the last path
        self.paths, self.labels_path = paths[:-1], paths[-1]

        with open(self.labels_path, encoding='utf-8') as fhin:
            lines = 0
            for line in fhin:
                if line.strip() != '':
                    lines += 1

            self.num_entries = lines

    def __getitem__(self, idx):
        data = [linecache.getline(p, idx + 1) for p in self.paths]
        label = linecache.getline(self.labels_path, idx + 1)

        return (*data, label)

    def __len__(self):
        return self.num_entries

As I wrote before, I am trying to convert a Keras model to PyTorch. The original Keras code does not use an embedding layer, and uses pre-built word2vec vectors per sentence as input. In the model below, there is no embedding layer. The Keras summary looks like this (I don't have access to the base model setup).

Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
bidirectional_1 (Bidirectional)  (200, 400)            417600
____________________________________________________________________________________________________
dropout_1 (Dropout)              (200, 800)            0           merge_1[0][0]
____________________________________________________________________________________________________
dense_1 (Dense)                  (200, 1)              801         dropout_1[0][0]
====================================================================================================

The issue is that with identical input, the Keras model works and gets a +0.5 Pearson correlation between predicted and actual labels. The PyTorch model above, though, does not seem to work at all. To give you an idea, here is the loss (mean squared error) and Pearson (correlation coefficient, p-value) after the first epoch:

Epoch 1 - completed in 11 seconds
Training Loss: 1.684495  Pearson: (-0.0006077809280690612, 0.8173368901481127)
Validation loss: 1.708228    Pearson: (0.017794288315261794, 0.4264098054188664)

And after the 100th epoch:

Epoch 100 - completed in 11 seconds
Training Loss: 1.660194  Pearson: (0.0020315421756790806, 0.4400929436716754)
Validation loss: 1.704910    Pearson: (-0.017288118524826892, 0.4396865964324158)

The loss is plotted below (when you look at the Y-axis, you can see the improvements are minimal).

A final indicator that something may be wrong, is that for my 140K lines of input, each epoch only takes 10 seconds on my GTX 1080TI. I feel that his is not much and I would guess that the optimisation is not working/running. I cannot figure out why, though. To issue will probably be in my train loop or the model itself, but I cannot find it.

Again, something must be going wrong because: - the Keras model does perform well; - the training speed is 'too fast' for 140K sentences - almost no improvemnts after training.

What am I missing? The issue is more than likely present in the training loop or in the network structure.

People who downvote and even vote to close because it is too broad: please read the whole question, and if it is too much for you to read, go to the next question. The code is very detailed for purposes of reproducibility. If you only want to look at where the issue is, look at the network class and at the training loop. — Bram Vanroy, Feb 22 '19 at 10:31

Szymon Maszke · Accepted Answer · 2019-02-27T17:26:10.977

TL;DR: Use permute instead of view when swapping axes, see the end of answer to get an intuition about the difference.

About RegressorNet (neural network model)

No need to freeze embedding layer if you are using from_pretrained. As documentation states, it does not use gradient updates.
This part:
```
self.w2v_rnode = nn.GRU(embeddings.size(1), hidden_dim, bidirectional=True, dropout=drop_prob)
```
and especially dropout without providable num_layers is totally pointless (as no dropout can be specified with shallow one layer network).
BUG AND MAIN ISSUE: in your forward function you are using view instead of permute, here:
```
w2v_out, _ = self.w2v_rnode(embeds.view(-1, batch_size, embeds.size(2)))
```
See this answer and appropriate documentation for each of those functions and try to use this line instead:
```
w2v_out, _ = self.w2v_rnode(embeds.permute(1, 0, 2))
```
You may consider using batch_first=True argument during w2v_rnode creation, you won't have to permute indices that way.
Check documentation of torch.nn.GRU, you are after last step of the sequence, not after all of the sequences you have there, so you should be after:
```
_, last_hidden = self.w2v_rnode(embeds.permute(1, 0, 2))
```
but I think this part is fine otherwise.

Data preparation

No offence, but prepare_lines is very unreadable and seems pretty hard to maintain as well, not to say spotting an eventual bug (I suppose it lies in here).

First of all, it seems like you are padding manually. Please don't do it that way, use torch.nn.pad_sequence to work with batches!

In essence, first you encode each word in every sentence as index pointing into embedding (as you seem to do in prepare_w2v), after that you use torch.nn.pad_sequence and torch.nn.pack_padded_sequence or torch.nn.pack_sequence if the lines are already sorted by length.

Proper batching

This part is very important and it seems you are not doing that at all (and likely this is the second error in your implementation).

PyTorch's RNN cells take inputs not as padded tensors, but as torch.nn.PackedSequence objects. This is an efficient object storing indices which specify unpadded length of each sequence.

See more informations on the topic here, here and in many other blog posts throughout the web.

First sequence in batch has to be the longest, and all others have to be provided in the descending length. What follows is:

You have to sort your batch each time by sequences length and sort your targets in an analogous way OR
Sort your batch, push it through the network and unsort it afterwards to match with your targets.

Either is fine, it's your call what seems to be more intuitive for you. What I like to do is more or less the following, hope it helps:

Create unique indices for each word and map each sentence appropriately (you've already done it).
Create regular torch.utils.data.Dataset object returning single sentence for each geitem, where it is returned as a tuple consisting of features (torch.Tensor) and labels (single value), seems like you're doing it as well.
Create custom collate_fn for use with torch.utils.data.DataLoader, which is responsible for sorting and padding each batch in this scenario (+ it returns lengths of each sentence to be passed into neural network).
Using sorted and padded features and their lengths I'm using torch.nn.pack_sequence inside neural network's forward method (do it after embedding!) to push it through RNN layer.
Depending on the use-case I unpack them using torch.nn.pad_packed_sequence. In your case, you only care about last hidden state, hence you don't have to do that. If you were using all of the hidden outputs (like is the case with, say, attention networks), you would add this part.

When it comes to the third point, here is a sample implementation of collate_fn, you should get the idea:

import torch


def length_sort(features):
    # Get length of each sentence in batch
    sentences_lengths = torch.tensor(list(map(len, features)))
    # Get indices which sort the sentences based on descending length
    _, sorter = sentences_lengths.sort(descending=True)
    # Pad batch as you have the lengths and sorter saved already
    padded_features = torch.nn.utils.rnn.pad_sequence(features, batch_first=True)
    return padded_features, sentences_lengths, sorter


def pad_collate_fn(batch):
    # DataLoader return batch like that unluckily, check it on your own
    features, labels = (
        [element[0] for element in batch],
        [element[1] for element in batch],
    )
    padded_features, sentences_lengths, sorter = length_sort(features)
    # Sort by length features and labels accordingly
    sorted_padded_features, sorted_labels = (
        padded_features[sorter],
        torch.tensor(labels)[sorter],
    )
    return sorted_padded_features, sorted_labels, sentences_lengths

Use those as collate_fn in DataLoaders and you should be just about fine (maybe with minor adjustments, so it's essential you understand the idea standing behind it).

Permute vs view vs reshape explanation

This simple example show the differences between permute() and view(). The first one swaps axes, while the second does not change memory layout, just chunks the array into desired shape (if possible).

import torch

a = torch.tensor([[1, 2], [3, 4], [5, 6]])

print(a)
print(a.permute(1, 0))
print(a.view(2, 3))

And the output would be:

tensor([[1, 2],
        [3, 4],
        [5, 6]])
tensor([[1, 3, 5],
        [2, 4, 6]])
tensor([[1, 2, 3],
        [4, 5, 6]])

reshape is almost like view, was added for those coming from numpy, so it's easier and more natural for them, but it has one important difference:

view never copies data and work only on contiguous memory (so after permutation like the one above your data may not be contiguous, hence acces to it might be slower)
reshape can copy data if needed, so it would work for non-contiguous arrays as well.

Thank you for the very elaborate answer! I am aware that the problem is not reproducible but I think that that is often the case with ML architecture issues because there is no easy way to share input data (providing external links is frowned upon at SO). I'm not sure how I could otherwise have asked this question. OT: if I understand correctly, GRU returns the output of the last layer and all previous layers *tensor containing the output features h_t from the last layer of the GRU, **for each t***, and only the last hidden state, correct? — Bram Vanroy, Feb 27 '19 at 11:49
I am asking because I would think to use the output rather than the last hidden state. (Not sure what the best approach would be here, though.) You are also suggesting padded batching, which I would like to implement indeed. So thank you for the explanation. But as it is currently written, using manual padding, I am not using it. Is there still any error in the way I use batches? I'll try out your suggestions in the coming days and I'll get back to you. Thanks again! — Bram Vanroy, Feb 27 '19 at 11:53
About batches: Yes, it is definitely an error. You are filling your sentences with noise of varying length (vectors filled with zeros), which affect the execution. There is no way for PyTorch to tell __where__ padding starts and original sentence ends. `torch.nn.PackedSequence` is an object containing indices telling the network just that, so it could be later `unpadded` with `torch.nn.pad_packed_sequence`. Furthermore, using `PackedSequence` you are not making unnecessary operations for you padded elements (RNN is not calculating hidden states for those). — Szymon Maszke, Feb 27 '19 at 12:04
About GRU: `output` contains hidden states from all timesteps __only for last layer__ (which does not matter in your case with shallow neural network), while `h_n` is the last hidden state (last timestep) __containing hidden from all layers__. For shallow network `output[-1]` is equal to `h_n`, shapes are just returned in a different fashion (due to all vs last [depth wise] layer being returned in the first and second case). Conceptually, `output` is hidden state but returned from each timestep, so essentially `hidden` == `output`. — Szymon Maszke, Feb 27 '19 at 12:09
Oh, and really change `.view` to `.permute` in `forward` in your neural network model, it's a bug as well as both do different things (you may want to check on some examples or see the link I've provided). — Szymon Maszke, Feb 27 '19 at 12:11
I just changed the `view()` with a `permute()` and now I am seeing results that I would actually expect (with still a fast training speed). Thanks! But I'd like to understand why. I read the documentation, but it's quite short about the difference between `view`, `reshape`, and `permute`. Can you give some more information why permute works and view doesn't? Because as far as I see it, the end shape is still the same. — Bram Vanroy, Feb 27 '19 at 15:53
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/189136/discussion-between-szymon-maszke-and-bram-vanroy). — Szymon Maszke, Feb 27 '19 at 16:03

RNN model (GRU) of word2vec to regression not learning

1 Answers1

About RegressorNet (neural network model)

Data preparation

Proper batching

Other possible problems and tips

Some final comments

Permute vs view vs reshape explanation