4

while transfer learning / fine-tuning recent language models, such as BERT and XLNET, is by far a very common practice, how is this for GloVe?

Basically, I see two options when using GloVe to get dense vector representations that can be used by downstream NNs.

1) Fine-tune GloVe embeddings (in pytorch terms, gradient enabled)

2) Just use the embeddings without gradient.

For instance, given GloVe's embeddings matrix, I do

embed = nn.Embedding.from_pretrained(torch.tensor(embedding_matrix, dtype=torch.float))
...
dense = nn.Linear(...)

Is it best practice to solely use GloVe to get vector representation (and only train the dense layer and potentially other layers) or would one fine-tune the embeddings matrix, too?

pedjjj
  • 958
  • 3
  • 18
  • 40

2 Answers2

14

You should absolutely fine-tune your word embedding matrix. Here is the thing, when you initialize the word embedding matrix with the GloVe word embeddings, your word embeddings will already capture most of the semantic properties of the data. However, you want your word embeddings to be tailored to the task your solving i.e task specific (Check Yang). Now, assuming that you don't have enough data in your dataset, you can't learn the word embedding matrix on your own (If you initialize the word embedding matrix with random vectors). Because of that, you want to initialize it with vectors that have been trained on huge datasets and are general.

One really important thing to keep in mind → Because the rest of your model is going to be initialized randomly, when you start training your word embedding matrix may suffer from catastrophic forgetting (Check the work of Howard and Ruder and Kirkpatrick et al.), i.e., the gradients will be huge because your model will drastically underfit the data for the first few batches, and you will lose the initial vectors completely. You can overcome this by:

  1. For the first several epochs don't fine-tune the word embedding matrix, just keep it as it is: embeddings = nn.Embedding.from_pretrained(glove_vectors, freeze=True).

  2. After the rest of the model has learned to fit your training data, decrease the learning rate, unfreeze the your embedding module embeddings.weight.requires_grad = True, and continue training.

By following the above mentioned steps, you will get the best of both worlds. In other words, your word embeddings will still capture semantic properties while being tailored for your own downstream task. Finally, there are works (Check Ye Zhang for example) showing that it is fine to fine-tune immediately, but I would opt for the safer option.

gorjan
  • 5,405
  • 2
  • 20
  • 40
  • If the optimum number of epochs is relatively small, e.g., 3-5, it would probably make sense to immediately fine-tune the embeddings as well, right? – pedjjj Nov 06 '19 at 17:58
  • Awesome question! No, not at all, you could train the model with frozen embedding matrix until convergence (Actually, it is the recommended way to do it), and then unfreeze the embedding matrix and let the model train for a couple of epochs. – gorjan Nov 06 '19 at 18:12
  • Cool, thank you! Do you happen to have a citable reference to such procedure? – pedjjj Nov 06 '19 at 18:14
  • 2
    The paper from [Howard and Ruder](https://arxiv.org/abs/1801.06146) (also included in the answer) is a really good source for transfer learning in NLP. Although it goes through the case of language model fine-tuning, the suggested methods are applicable elsewhere. – gorjan Nov 06 '19 at 18:18
  • 2
    Thank you, that's pretty useful! :) – pedjjj Nov 07 '19 at 11:20
2

There is no reason not to fine-tune the GloVe embeddings in order to have better score for your final task, except if you have to keep a link with another model which uses the original embeddings (for interpreting your results for instance).

When fine-tuning the embeddings for your objective function, the word embeddings will (potentially) loose their initial properties (performing well for word similarity and analogy tasks).

Using word embeddings is just a way not to initialize with random vectors, so would it make sense to keep the random vectors fixed?

There are several articles which fine tune the word embeddings, for instance this one: https://arxiv.org/abs/1505.07931

I made the assumption that you have enough training data. Otherwise it would be better to let the word embeddings fixed since it involves less parameters to train and thus avoids overfitting.

Stanislas Morbieu
  • 1,721
  • 7
  • 11