Using pretrained glove word embedding with scikit-learn

Question

I have used keras to use pre-trained word embeddings but I am not quite sure how to do it on scikit-learn model.

I need to do this in sklearn as well because I am using vecstack to ensemble both keras sequential model and sklearn model.

This is what I have done for keras model:

glove_dir = '/home/Documents/Glove'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), 'r', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

embedding_dim = 200


embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
.
.
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False
model.compile(----)
model.fit(-----)

I am very new to scikit-learn, from what I have seen to make an model in sklearn you do:

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.predict(x_test)

So, my question is how do I use pre-trained Glove with this model? where do I pass the pre-trained glove embedding_matrix

Thank you very much and I really appreciate your help.

Please describe what you model you want to build in `sklearn`, best with formula and/or descriptive diagram. — dedObed, Mar 16 '19 at 16:20
Hello, I just want a logistic regression model with pre-trained word embedding and take the average of word embedding vectors. — BlueMango, Mar 16 '19 at 16:30
Input is the amazon review. Since it's a review(text), word embeddings plays a huge role, right? — BlueMango, Mar 16 '19 at 16:44
So you want to input.... a bag-of-words representation of some text, i.e. a fixed length vector of counts of individual words in the text? — dedObed, Mar 16 '19 at 17:03
Well yes and no. I have used Tokenizer to vectorize and convert text into Sequences so it can be used as an input. Instead of Bag of Words I want word embeddings beacause I think bag of word approach is very domain specific and I also want to work cross domain. — BlueMango, Mar 16 '19 at 17:13
@BlueMango I am trying to work on a similar problem now. I think what you want to do is once you have your vectorized documents in a sparse matrix, you can add some additional columns that include the word embedding (i.e. R-vector) average of all the words in the document. That should be an additional number of features that bring context into the classifier from outside your corpus. — Mike, Jun 07 '19 at 16:14
@BlueMango, Have you solved this problem? I also need to use glove embedding with sklearn Machine learning model. Please do update? — Aizayousaf, Feb 08 '23 at 23:53

Wajsbrot · Answer 1 · 2020-01-06T18:49:31.220

11

You can simply use the Zeugma library.

You can install it with pip install zeugma, then create and train your model with the following lines of code (assuming corpus_train and corpus_test are lists of strings):

from sklearn.linear_model import LogisticRegresion
from zeugma.embeddings import EmbeddingTransformer

glove = EmbeddingTransformer('glove')
x_train = glove.transform(corpus_train)

model = LogisticRegression()
model.fit(x_train, y_train)

x_test = glove.transform(corpus_test)
model.predict(x_test)

You can also use different pre-trained embeddings (complete list here) or train your own (see Zeugma's documentation for how to do this).

edited Jan 06 '20 at 18:49

answered Nov 07 '19 at 15:08

Wajsbrot

331
2
5

This code no longer works with Gensim 4.0.0 or higher. – TheChicken4452 Jul 19 '21 at 12:33
1

Since today, Zeugma should now support Gensim 4.0+. Just upgrade to the latest version (0.49+) with `pip install -U zeugma` – Wajsbrot Jul 22 '21 at 02:45
Yeah, I saw, I'm upgrading it at this moment. – TheChicken4452 Jul 22 '21 at 14:25
There's any alternative to zeugma ? seems to me not supported anymore :/ – Daniel Wiczew Feb 06 '22 at 17:16
Hey @DanielWiczew I'm not aware of alternatives but Zeugma is still maintained, the just hasn't been commits recently because none were needed. Let me know if you experience issues with it. – Wajsbrot Feb 07 '22 at 20:01

Using pretrained glove word embedding with scikit-learn

1 Answers1

Linked