I'm running into a roadblock in my learning about NLP. I'm working on a beginner's Kaggle competition classifying tweets as "disaster" or "not disaster". I started out by repurposing a simple network from a PyTorch tutorial comprised of nn.EmbeddingBag
and nn.Linear
layers and saw decent results during both training and inference:
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
self.fc = nn.Linear(embed_dim, num_class)
The loss function is BCEWithLogits
, by the way.
I decided to up my game and throw an LSTM into the mix. I took a deep dive into padded/packed sequences and think I understand them pretty well. After perusing around and thinking about it, I came to the conclusion that I should be grabbing the final non-padded hidden state of each sequence's output from the LSTM. That's what I tried below:
My attempt at upping my game:
class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.lstm = nn.LSTM(embed_dim, hidden_size, batch_first=True)
self.fc1 = nn.Linear(hidden_size, num_class)
def forward(self, padded_seq, lengths):
# embedding layer
embedded_padded = self.embedding(padded_seq)
packed_output = pack_padded_sequence(embedded_padded, lengths, batch_first=True)
# lstm layer
output, _ = self.lstm(packed_output)
padded_output, lengths = pad_packed_sequence(output, batch_first=True)
# get hidden state of final non-padded sequence element:
h_n = []
for seq, length in zip(padded_output, lengths):
h_n.append(seq[length - 1, :])
lstm_out = torch.stack(h_n)
# linear layers
out = self.fc1(lstm_out)
return out
This morning, I ported my notebook over to an IDE and ran the debugger and confirmed that h_n
is indeed the final hidden state of each sequence, not including padding.
So everything runs/trains without error but my loss never decreases when I use batch size > 1.
My Question
I would have expected this LSTM setup to perform much better on this simple task. So I'm wondering "Where have I gone wrong?"
Additional Information: Training Code
def train_one_epoch(model, opt, criterion, lr, trainloader):
model.to(device)
model.train()
running_tl = 0
for (label, data, lengths) in trainloader:
opt.zero_grad()
label = label.reshape(label.size()[0], 1)
output = model(data, lengths)
loss = criterion(output, label)
running_tl += loss.item()
loss.backward()
opt.step()
return running_tl
def validate_one_epoch(model, opt, criterion, lr, validloader):
running_vl = 0
model.eval()
with torch.no_grad():
for (label, data, lengths) in validloader:
label = label.reshape(label.shape[0], 1)
output = model(data, lengths)
loss = criterion(output, label)
running_vl += loss.item()
return running_vl
def train_model(model, opt, criterion, epochs, trainload, testload=None, lr=1e-3):
avg_tl_per_epoch = []
avg_vl_per_epoch = []
for e in trange(epochs):
running_tl = train_one_epoch(model, opt, criterion, lr, trainload)
avg_tl_per_epoch.append(running_tl / len(trainload))
if testload:
running_vl = validate_one_epoch(model, opt, criterion, lr, validloader)
avg_vl_per_epoch.append(running_vl / len(testload))
return avg_tl_per_epoch, avg_vl_per_epoch