11

I am using pre-trained Google news dataset for getting word vectors by using Gensim library in python

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

After loading the model I am converting training reviews sentence words into vectors

#reading all sentences from training file
with open('restaurantSentences', 'r') as infile:
x_train = infile.readlines()
#cleaning sentences
x_train = [review_to_wordlist(review,remove_stopwords=True) for review in x_train]
train_vecs = np.concatenate([buildWordVector(z, n_dim) for z in x_train])

During word2Vec process i get a lot of errors for the words in my corpus, that are not in the model. Problem is how can i retrain already pre-trained model (e.g GoogleNews-vectors-negative300.bin'), in order to get word vectors for those missing words.

Following is what I have tried: Trained a new model from training sentences that I had

# Set values for various parameters
num_features = 300    # Word vector dimensionality                      
min_word_count = 10   # Minimum word count                        
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window    size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words

sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
# Initialize and train the model (this will take some time)
print "Training model..."
model = gensim.models.Word2Vec(sentences, workers=num_workers,size=num_features, min_count = min_word_count, 
                      window = context, sample = downsampling)


model.build_vocab(sentences)
model.train(sentences)
model.n_similarity(["food"], ["rice"])

It worked! but the problem is that I have a really small dataset and less resources to train a large model.

Second way that I am looking at is to extend the already trained model such as GoogleNews-vectors-negative300.bin.

model = Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
sentences = gensim.models.word2vec.LineSentence("restaurantSentences")
model.train(sentences)

Is it possible and is that a good way to use, please help me out

Lior Magen
  • 1,533
  • 2
  • 15
  • 33
Nomiluks
  • 2,052
  • 5
  • 31
  • 53
  • 1
    Possible duplicate of [Update gensim word2vec model](http://stackoverflow.com/questions/22121028/update-gensim-word2vec-model) – Kamil Sindi Dec 02 '16 at 16:29

3 Answers3

6

This is how I technically solved the issue:

Preparing data input with sentence iterable from Radim Rehurek: https://rare-technologies.com/word2vec-tutorial/

sentences = MySentences('newcorpus')  

Setting up the model

model = gensim.models.Word2Vec(sentences)

Intersecting the vocabulary with the google word vectors

model.intersect_word2vec_format('GoogleNews-vectors-negative300.bin',
                                lockf=1.0,
                                binary=True)

Finally executing the model and updating

model.train(sentences)

A note of warning: From a substantive point of view, it is of course highly debatable whether a corpus likely to be very small can actually "improve" the Google wordvectors trained on a massive corpus...

Chris Arnold
  • 91
  • 1
  • 5
  • 2
    Your comment suggests that this method is designed to "improve" Google's word vectors. [The documentation](https://radimrehurek.com/gensim/models/word2vec.html) would suggest that it actually uses Google's vectors to improve your model, not the other way around. *(No words are added to the existing vocabulary, but intersecting words adopt the file’s weights, and non-intersecting words are left alone.)* I tried your method and checked my model's corpus size. It reflected the new training data, not Google News. – lgallen May 10 '17 at 02:33
  • You are right - maybe the term _to improve_ is misleading here. What the code does is it updates words from the new corpus and returns you those. – Chris Arnold May 11 '17 at 09:59
  • The size(vocab) of Google's word vectors is around 3,000,000 words, so if your corpus is having a much lesser size, like around 10,000 words, on intersecting, the size of your model will continue to be 10,000 only but the words in your model will be directly assigned the weights from Google's word vector, completely ignoring the previous weights from your original model. So it wouldn't make any difference untill and unless you too have a very large corpus to train on. – Pranzell Mar 18 '19 at 07:25
  • I tried this: model.intersect_word2vec_format('tweets_cbow_300',lockf=1.0,binary=False, encoding='utf8') It is returning an error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte Adding this: unicode_errors='ignore' doesn't solve the issue. – user3057109 Mar 08 '20 at 12:58
2

Some folks have been working on extending gensim to allow online training.

A couple GitHub pull requests you might want to watch for progress on that effort:

It looks like this improvement could allow updating the GoogleNews-vectors-negative300.bin model.

orluke
  • 2,041
  • 1
  • 17
  • 15
2

it is possible if model builder didn't finalize the model training . in python it is:

model.sims(replace=True) #finalize the model

if the model didn't finalize it is a perfect way to have model with large dataset.

Majid
  • 23
  • 5