Embedding in pytorch

Question

Does Embedding make similar words closer to each other? And do I just need to give to it all the sentences? Or it is just a lookup table and I need to code the model?

score 93 · Answer 1 · answered Aug 03 '18 at 08:19

93

nn.Embedding holds a Tensor of dimension (vocab_size, vector_size), i.e. of the size of the vocabulary x the dimension of each vector embedding, and a method that does the lookup.

When you create an embedding layer, the Tensor is initialised randomly. It is only when you train it when this similarity between similar words should appear. Unless you have overwritten the values of the embedding with a previously trained model, like GloVe or Word2Vec, but that's another story.

So, once you have the embedding layer defined, and the vocabulary defined and encoded (i.e. assign a unique number to each word in the vocabulary) you can use the instance of the nn.Embedding class to get the corresponding embedding.

For example:

import torch
from torch import nn
embedding = nn.Embedding(1000,128)
embedding(torch.LongTensor([3,4]))

will return the embedding vectors corresponding to the word 3 and 4 in your vocabulary. As no model has been trained, they will be random.

answered Aug 03 '18 at 08:19

Escachator

1,742
1
16
32

10

For example, if i have a neural machine translation model and i dont use pretrained embedding, the embedding layer will randomly initialize word vector and train those vectors along with the translation model ? – MaybeNextTime Jul 16 '19 at 15:11
14

Exactly, they will be initially random, and will be trainable parameters of the model. – Escachator Jul 16 '19 at 15:31
3

@Escachator I got the principle which you explained. I mainly wanted to know how the embeddings learned in the trained model get applied to the test data. For instance, I have ~20,000 words which I have converted to numbers and I've carried out its embedding. Now, in the test set, if I have a sequence of [3,2,5,4], does the model weights already contain a way to do factorization and carry out the embeddings? I'm just not sure of the maths in the background. – Kanishk Mair Feb 22 '20 at 15:34
2

@KanishkMair the embeddings in test set are fixed. There is no factorization or anything to be done. `embedding(torch.LongTensor([3,2,5,4]))` will just return the embedding of the word (or token) 3, and of word 2, etc in a Tensor of `[dim_embedding, 4]` – Escachator Mar 22 '20 at 19:44
@Escachator you said, "As no model has been trained, they will be random." I thought that during training similar words get closer during training as said in the question. – Kanishk Mair Mar 24 '20 at 04:02

score 47 · Answer 2 · edited Aug 29 '20 at 13:44

47

You could treat nn.Embedding as a lookup table where the key is the word index and the value is the corresponding word vector. However, before using it you should specify the size of the lookup table, and initialize the word vectors yourself. Following is a code example demonstrating this.

import torch.nn as nn 

# vocab_size is the number of words in your train, val and test set
# vector_size is the dimension of the word vectors you are using
embed = nn.Embedding(vocab_size, vector_size)

# intialize the word vectors, pretrained_weights is a 
# numpy array of size (vocab_size, vector_size) and 
# pretrained_weights[i] retrieves the word vector of
# i-th word in the vocabulary
embed.weight.data.copy_(torch.fromnumpy(pretrained_weights))

# Then turn the word index into actual word vector
vocab = {"some": 0, "words": 1}
word_indexes = [vocab[w] for w in ["some", "words"]] 
word_vectors = embed(word_indexes)

edited Aug 29 '20 at 13:44

Julio Cezar Silva

2,148
1
21
30

answered Jun 27 '18 at 10:23

AveryLiu

819
8
18

6

So still I don't get the method these randomly initialized embeddings are learnt throughout the training process. Is that a simple CBOW or Skip-gram procedure or something else? – hexpheus Mar 04 '19 at 19:17
4

The point is that nn.Embedding DOES NOT care whatever method you used to train the word embeddings, it is merely a "matrix" that stores the trained embeddings. While using nn.Embedding to load external word embeddings such as Glove or FastText, it is the duty of these external word embeddings to determine the training method. – AveryLiu Mar 05 '19 at 08:29
2

I get your point. However, when a weight matrix is not specified (randomly initialized), how is that fine-tuned during the training process? These weights are bottleneck weights of an autoencoder maybe? Is there a simple reconstruction happening in the background during the fine-tuning? – hexpheus Mar 10 '19 at 07:52
3

If you choose to fine-tune word vectors during training, these word vectors are treated as model parameters and are updated by backpropagation. – AveryLiu Jun 17 '19 at 06:48
7

Is `nn.Embedding` is trainable layer? – mrgloom Aug 05 '19 at 16:23
4

Yes, `nn.Embedding` is also a model parameter layer, which is by default trainable, and you can also make it untrainable by freezing its gradient. – AveryLiu Nov 10 '19 at 06:54
1

@AveryLiu it would be a bad decision to use fast-text or other n-gram techniques with embedding layers as is. Feature-wise it makes sense, but you are going to lose all your typo and grammar invariant goodies that fast text provides, because it computes unknown word feature vectors from the n-grams, meaning that your `word_indexes` would be +Inf. Just proccess each word/sentence/x with fasttext before plugging it in an input layer. – ntakouris Dec 08 '19 at 14:15
2

@Zarkopafilis thanks for pointing out issues with n-gram word embeddings. Also, For people who come here by Google, it would be nice to look at pre-trained language models such as BERT, XLNet, etc. They usually get better results than using old-fashioned embeddings. – AveryLiu Dec 09 '19 at 07:00
1

@AveryLiu you can use pre-trained fast text models too. Works better in chat and social-media scenarios where most typos / minor grammatical errors occur. – ntakouris Dec 09 '19 at 19:21
anyone know how what is padding_ix? – user614287 Apr 05 '20 at 15:36
@user614287 A great description about the ```padding_idx``` could be found in here. My understanding is that, if you set, for instance, ```padding_idx = 5```, when the embedding layer sees an input with id = 5, the embedding layer could return a zero vector and backpropagation would not update corresponding parameters: https://pytorch.org/docs/master/generated/torch.nn.Embedding.html – Bright Chang Jul 12 '20 at 12:53

Garima Jain · Answer 3 · 2021-09-29T09:45:05.123

torch.nn.Embedding just creates a Lookup Table, to get the word embedding given a word index.

from collections import Counter
import torch.nn as nn

# Let's say you have 2 sentences(lowercased, punctuations removed) :
sentences = "i am new to PyTorch i am having fun"

words = sentences.split(' ')
    
vocab = Counter(words) # create a dictionary
vocab = sorted(vocab, key=vocab.get, reverse=True)
vocab_size = len(vocab)

# map words to unique indices
word2idx = {word: ind for ind, word in enumerate(vocab)} 

# word2idx = {'i': 0, 'am': 1, 'new': 2, 'to': 3, 'pytorch': 4, 'having': 5, 'fun': 6}

encoded_sentences = [word2idx[word] for word in words]

# encoded_sentences = [0, 1, 2, 3, 4, 0, 1, 5, 6]

# let's say you want embedding dimension to be 3
emb_dim = 3

Now, embedding layer can be initialized as :

emb_layer = nn.Embedding(vocab_size, emb_dim)
word_vectors = emb_layer(torch.LongTensor(encoded_sentences))

This initializes embeddings from a standard Normal distribution(that is 0 mean and unit variance). Thus, these word vectors don't have any sense of 'relatedness'.

word_vectors is a torch tensor of size (9,3). (since there are 9 words in our data)

emb_layer has one trainable parameter called weight, which is, by default, set to be trained. You can check it by :

emb_layer.weight.requires_grad

which returns True. If you don't want to train your embeddings during model training(say, when you are using pre-trained embeddings), you can set them to False by :

emb_layer.weight.requires_grad = False

If your vocabulary size is 10,000 and you wish to initialize embeddings using pre-trained embeddings(of dim 300), say, Word2Vec, do it as :

emb_layer = nn.Embedding(10000, 300)
emb_layer.load_state_dict({'weight': torch.from_numpy(emb_mat)})

here, emb_mat is a Numpy matrix of size (10,000, 300) containing 300-dimensional Word2vec word vectors for each of the 10,000 words in your vocabulary.

Now, the embedding layer is loaded with Word2Vec word representations.

prosti · Answer 4 · 2020-05-27T20:31:39.287

Agh! I think this part is still missing. Showcasing that when you set the embedding layer you automatically get the weights, that you may later alter with nn.Embedding.from_pretrained(weight)

import torch
import torch.nn as nn

embedding = nn.Embedding(10, 4)
print(type(embedding))
print(embedding)

t1 = embedding(torch.LongTensor([0,1,2,3,4,5,6,7,8,9])) # adding, 10 won't work
print(t1.shape)
print(t1)


t2 = embedding(torch.LongTensor([1,2,3]))
print(t2.shape)
print(t2)

#predefined weights
weight = torch.FloatTensor([[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]])
print(weight.shape)
embedding = nn.Embedding.from_pretrained(weight)
# get embeddings for ind 0 and 1
embedding(torch.LongTensor([0, 1]))

Output:

<class 'torch.nn.modules.sparse.Embedding'>
Embedding(10, 4)
torch.Size([10, 4])
tensor([[-0.7007,  0.0169, -0.9943, -0.6584],
        [-0.7390, -0.6449,  0.1481, -1.4454],
        [-0.1407, -0.1081,  0.6704, -0.9218],
        [-0.2738, -0.2832,  0.7743,  0.5836],
        [ 0.4950, -1.4879,  0.4768,  0.4148],
        [ 0.0826, -0.7024,  1.2711,  0.7964],
        [-2.0595,  2.1670, -0.1599,  2.1746],
        [-2.5193,  0.6946, -0.0624, -0.1500],
        [ 0.5307, -0.7593, -1.7844,  0.1132],
        [-0.0371, -0.5854, -1.0221,  2.3451]], grad_fn=<EmbeddingBackward>)
torch.Size([3, 4])
tensor([[-0.7390, -0.6449,  0.1481, -1.4454],
        [-0.1407, -0.1081,  0.6704, -0.9218],
        [-0.2738, -0.2832,  0.7743,  0.5836]], grad_fn=<EmbeddingBackward>)
torch.Size([2, 3])

tensor([[0.1000, 0.2000, 0.3000],
        [0.4000, 0.5000, 0.6000]])

And the last part is that the Embedding layer weights can be learned with the gradient descent.

Embedding in pytorch

4 Answers4

Linked