7

I am using GloVe as part of my research. I've downloaded the models from here. I've been using GloVe for sentence classification. The sentences I'm classifying are specific to a particular domain, say some STEM subject. However, since the existing GloVe models are trained on a general corpus, they may not yield the best results for my particular task.

So my question is, how would I go about loading the retrained model and just retraining it a little more on my own corpus to learn the semantics of my corpus as well? There would be merit in doing this were it possible.

cs95
  • 379,657
  • 97
  • 704
  • 746

3 Answers3

2

After a little digging, I found this issue on the git repo. Someone suggested the following:

Yeah, this is not going to work well due to the optimization setup. But what you can do is train GloVe vectors on your own corpus and then concatenate those with the pretrained GloVe vectors for use in your end application.

So that answers that.

cs95
  • 379,657
  • 97
  • 704
  • 746
1

I believe GloVe (Global Vectors) is not meant to be appended, since it is based on the corpus' overall word co-occurrence statistics from a single corpus known only at initial training time

You can do is use gensim.scripts.glove2word2vec api to convert GloVe vectors into word2vec, but i dont think you can continue training since its loading in a KeyedVector not a Full Model

StevenWernerCS
  • 839
  • 9
  • 15
1

Mittens library (installable via pip) does that if your corpus/vocab is not too huge or your RAM is big enough to handle the entire co-occurrence matrix.

3 steps-

import csv
import numpy as np
from collections import Counter
from nltk.corpus import brown
from mittens import GloVe, Mittens
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer

1- Load pretrained model - Mittens needs a pretrained model to be loaded as a dictionary. Get the pretrained model from https://nlp.stanford.edu/projects/glove

with open("glove.6B.100d.txt", encoding='utf-8') as f:
    reader = csv.reader(f, delimiter=' ',quoting=csv.QUOTE_NONE)
    embed = {line[0]: np.array(list(map(float, line[1:])))
            for line in reader}

Data pre-processing

sw = list(stop_words.ENGLISH_STOP_WORDS)
brown_data = brown.words()[:200000]
brown_nonstop = [token.lower() for token in brown_data if (token.lower() not in sw)]
oov = [token for token in brown_nonstop if token not in pre_glove.keys()]

Using brown corpus as a sample dataset here and new_vocab represents the vocabulary not present in pretrained glove. The co-occurrence matrix is built from new_vocab. It is a sparse matrix, requiring a space complexity of O(n^2). You can optionally filter out rare new_vocab words to save space

new_vocab_rare = [k for (k,v) in Counter(new_vocab).items() if v<=1]
corp_vocab = list(set(new_vocab) - set(new_vocab_rare))

remove those rare words and prepare dataset

brown_tokens = [token for token in brown_nonstop if token not in new_vocab_rare]
brown_doc = [' '.join(brown_tokens)]
corp_vocab = list(set(new_vocab))

2- Building co-occurrence matrix: sklearn’s CountVectorizer transforms the document into word-doc matrix. The matrix multiplication Xt*X gives the word-word co-occurrence matrix.

cv = CountVectorizer(ngram_range=(1,1), vocabulary=corp_vocab)
X = cv.fit_transform(brown_doc)
Xc = (X.T * X)
Xc.setdiag(0)
coocc_ar = Xc.toarray()

3- Fine-tuning the mittens model - Instantiate the model and run the fit function.

mittens_model = Mittens(n=50, max_iter=1000)
new_embeddings = mittens_model.fit(
    coocc_ar,
    vocab=corp_vocab,
    initial_embedding_dict= pre_glove)

Save the model as pickle for future use.

newglove = dict(zip(corp_vocab, new_embeddings))
f = open("repo_glove.pkl","wb")
pickle.dump(newglove, f)
f.close()
Abhi25t
  • 3,703
  • 3
  • 19
  • 32