I would like to use pre-trained embeddings in my neural network architecture. The pre-trained embeddings are trained by gensim. I found this informative answer which indicates that we can load pre_trained models like so:
import gensim
from torch import nn
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)
emb = nn.Embedding.from_pretrained(torch.FloatTensor(weights.vectors))
This seems to work correctly, also on 1.0.1. My question is, that I don't quite understand what I have to feed into such a layer to utilise it. Can I just feed the tokens (segmented sentence)? Do I need a mapping, for instance token-to-index?
I found that you can access a token's vector simply by something like
print(weights['the'])
# [-1.1206588e+00 1.1578362e+00 2.8765252e-01 -1.1759659e+00 ... ]
What does that mean for an RNN architecture? Can we simply load in the tokens of the batch sequences? For instance:
for seq_batch, y in batch_loader():
# seq_batch is a batch of sequences (tokenized sentences)
# e.g. [['i', 'like', 'cookies'],['it', 'is', 'raining'],['who', 'are', 'you']]
output, hidden = model(seq_batch, hidden)
This does not seem to work so I am assuming you need to convert the tokens to its index in the final word2vec model. Is that true? I found that you can get the indices of words by using the word2vec model's vocab
:
weights.vocab['world'].index
# 147
So as an input to an Embedding layer, should I provide a tensor of int
for a sequence of sentences that consist of a sequence of words? Example use with dummy dataloader (cf. example above) and dummy RNN welcome.