I'm coming from Keras to PyTorch. I would like to create a PyTorch Embedding layer (a matrix of size V x D
, where V
is over vocabulary word indices and D
is the embedding vector dimension) with GloVe vectors but am confused by the needed steps.
In Keras, you can load the GloVe vectors by having the Embedding layer constructor take a weights
argument:
# Keras code.
embedding_layer = Embedding(..., weights=[embedding_matrix])
When looking at PyTorch and the TorchText library, I see that the embeddings should be loaded twice, once in a Field
and then again in an Embedding
layer. Here is sample code that I found:
# PyTorch code.
# Create a field for text and build a vocabulary with 'glove.6B.100d'
# pretrained embeddings.
TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
TEXT.build_vocab(train_data, vectors='glove.6B.100d')
# Build an RNN model with an Embedding layer.
class RNN(nn.Module):
def __init__(self, ...):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
...
# Initialize the embedding layer with the Glove embeddings from the
# vocabulary. Why are two steps needed???
model = RNN(...)
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)
Specifically:
- Why are the GloVe embeddings loaded in a
Field
in addition to theEmbedding
? - I thought the
Field
functionbuild_vocab()
just builds its vocabulary from the training data. How are the GloVe embeddings involved here during this step?
Here are other StackOverflow questions that did not answer my questions:
PyTorch / Gensim - How to load pre-trained word embeddings
PyTorch LSTM - using word embeddings instead of nn.Embedding()
Thanks for any help.