Why is a simple embedding+linear layer outperforming this LSTM classifier?

Question

I'm running into a roadblock in my learning about NLP. I'm working on a beginner's Kaggle competition classifying tweets as "disaster" or "not disaster". I started out by repurposing a simple network from a PyTorch tutorial comprised of nn.EmbeddingBag and nn.Linear layers and saw decent results during both training and inference:

self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
self.fc = nn.Linear(embed_dim, num_class)

The loss function is BCEWithLogits, by the way.

I decided to up my game and throw an LSTM into the mix. I took a deep dive into padded/packed sequences and think I understand them pretty well. After perusing around and thinking about it, I came to the conclusion that I should be grabbing the final non-padded hidden state of each sequence's output from the LSTM. That's what I tried below:

My attempt at upping my game:


class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, hidden_size, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
        self.fc1 = nn.Linear(hidden_size, num_class)

    def forward(self, padded_seq, lengths):
        
        # embedding layer
        embedded_padded = self.embedding(padded_seq)
        packed_output = pack_padded_sequence(embedded_padded, lengths, batch_first=True)

        # lstm layer
        output, _ = self.lstm(packed_output)
        padded_output, lengths = pad_packed_sequence(output, batch_first=True)

        # get hidden state of final non-padded sequence element:
        h_n = []
        for seq, length in zip(padded_output, lengths):
            h_n.append(seq[length - 1, :])
        
        lstm_out = torch.stack(h_n)
        
        # linear layers
        out = self.fc1(lstm_out)
        return out

This morning, I ported my notebook over to an IDE and ran the debugger and confirmed that h_n is indeed the final hidden state of each sequence, not including padding.

So everything runs/trains without error but my loss never decreases when I use batch size > 1.

With batch_size = 8:

With batch_size = 1:

My Question

I would have expected this LSTM setup to perform much better on this simple task. So I'm wondering "Where have I gone wrong?"

Additional Information: Training Code

def train_one_epoch(model, opt, criterion, lr, trainloader):
    model.to(device)
    model.train()
    
    running_tl = 0
    
    for (label, data, lengths) in trainloader:
        
        opt.zero_grad()
        label = label.reshape(label.size()[0], 1)
        
        output = model(data, lengths)
        loss = criterion(output, label)

        running_tl += loss.item()
        loss.backward()
        opt.step()
        
    return running_tl
        
def validate_one_epoch(model, opt, criterion, lr, validloader):
    
    running_vl = 0
    
    model.eval()
    with torch.no_grad():
        for (label, data, lengths) in validloader:
            label = label.reshape(label.shape[0], 1)
            output = model(data, lengths)
            loss = criterion(output, label)
            running_vl += loss.item()
            
    return running_vl
    

def train_model(model, opt, criterion, epochs, trainload, testload=None, lr=1e-3):
    
    avg_tl_per_epoch = []
    avg_vl_per_epoch = []
    
    for e in trange(epochs):
        running_tl = train_one_epoch(model, opt, criterion, lr, trainload)
        avg_tl_per_epoch.append(running_tl / len(trainload))
        if testload:
            running_vl = validate_one_epoch(model, opt, criterion, lr, validloader)
            avg_vl_per_epoch.append(running_vl / len(testload))
    
    return avg_tl_per_epoch, avg_vl_per_epoch

score 0 · Answer 1 · answered Jul 26 '21 at 16:52

I think your model should look like that :

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, hidden_size, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
        self.fc1 = nn.Linear(hidden_size, num_class)

    def forward(self, padded_seq, lengths):

        
        # embedding layer
        embedded_padded = self.embedding(padded_seq)
        packed_output = pack_padded_sequence(embedded_padded, lengths, batch_first=True)

        # lstm layer
        output, _ = self.lstm(packed_output)
 
        out = self.fc1(output)
        return out

As, by default, the LSTM will just output the last hidden state as an output when provided with a sequence.

Also depending on the number of examples, the simple embedding + linear model might work better as it needs fewer data to converge. Your data being tweets (very short text) the sequential aspect of the text might not be so important.

You have not provided the code for preprocessing your data. With text a good preprocessing is crucial and I recommend you to take a look to the pytorch tutorial called NLP FROM SCRATCH: TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION.

The output of the LSTM is two fold: The hidden states of the final layer at all time steps, and the hidden state at the last timestep of every layer. It definitely does not return only the last hidden state. Here is a reference for you: https://stackoverflow.com/questions/48302810/whats-the-difference-between-hidden-and-output-in-pytorch-lstm — rocksNwaves, Jul 26 '21 at 17:37

Why is a simple embedding+linear layer outperforming this LSTM classifier?

My attempt at upping my game:

My Question

Additional Information: Training Code

1 Answers1