Mittens library (installable via pip) does that if your corpus/vocab is not too huge or your RAM is big enough to handle the entire co-occurrence matrix.
3 steps-
import csv
import numpy as np
from collections import Counter
from nltk.corpus import brown
from mittens import GloVe, Mittens
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer
1- Load pretrained model - Mittens needs a pretrained model to be loaded as a dictionary. Get the pretrained model from https://nlp.stanford.edu/projects/glove
with open("glove.6B.100d.txt", encoding='utf-8') as f:
reader = csv.reader(f, delimiter=' ',quoting=csv.QUOTE_NONE)
embed = {line[0]: np.array(list(map(float, line[1:])))
for line in reader}
Data pre-processing
sw = list(stop_words.ENGLISH_STOP_WORDS)
brown_data = brown.words()[:200000]
brown_nonstop = [token.lower() for token in brown_data if (token.lower() not in sw)]
oov = [token for token in brown_nonstop if token not in pre_glove.keys()]
Using brown corpus as a sample dataset here and new_vocab
represents the vocabulary not present in pretrained glove. The co-occurrence matrix is built from new_vocab. It is a sparse matrix, requiring a space complexity of O(n^2). You can optionally filter out rare new_vocab words to save space
new_vocab_rare = [k for (k,v) in Counter(new_vocab).items() if v<=1]
corp_vocab = list(set(new_vocab) - set(new_vocab_rare))
remove those rare words and prepare dataset
brown_tokens = [token for token in brown_nonstop if token not in new_vocab_rare]
brown_doc = [' '.join(brown_tokens)]
corp_vocab = list(set(new_vocab))
2- Building co-occurrence matrix:
sklearn’s CountVectorizer
transforms the document into word-doc matrix.
The matrix multiplication Xt*X
gives the word-word co-occurrence matrix.
cv = CountVectorizer(ngram_range=(1,1), vocabulary=corp_vocab)
X = cv.fit_transform(brown_doc)
Xc = (X.T * X)
Xc.setdiag(0)
coocc_ar = Xc.toarray()
3- Fine-tuning the mittens model - Instantiate the model and run the fit function.
mittens_model = Mittens(n=50, max_iter=1000)
new_embeddings = mittens_model.fit(
coocc_ar,
vocab=corp_vocab,
initial_embedding_dict= pre_glove)
Save the model as pickle for future use.
newglove = dict(zip(corp_vocab, new_embeddings))
f = open("repo_glove.pkl","wb")
pickle.dump(newglove, f)
f.close()