Replace random word by similarity with word2vec

Question

I would like to replace a random word from a sentence by the most similar word from word2vec, for example a word from the sentence question = 'Can I specify which GPU to use?'.

I used this recursive method because with the split function, some words (like to) are not in word2vecmodel:

import gensim.models.keyedvectors as word2vec
import random as rd

model = word2vec.KeyedVectors.load_word2vec_format('/Users/nbeau/Desktop/Word2vec/model/GoogleNews-vectors-negative300.bin', binary=True)

def similar_word(sentence, size):
    pos_to_replace = rd.randint(0, size-1)
    try:
        similarity = model.most_similar(positive = [sentence[pos_to_replace]])
        similarity = similarity[0][0]
    except KeyError:
        similarity, pos_to_replace = test(sentence, size)
        return similarity, pos_to_replace
    return similarity, pos_to_replace

question = question.split()
size = len(question)
similarity, pos_to_replace = similar_word(question, size)
sentence[pos_to_replace] = similarity

I would like to know if there is a better method to avoid the words which are not in the word2vec model.

gojomo · Accepted Answer · 2019-04-30T18:17:41.650

A few thoughts:

If kv_model is your KeyedVectors model, you can do 'to' in kv_model to test if a word is present, rather than trying but then catching the KeyError. But being optimistic & catching the error is a common idiom as well!
Your recursion won't necessarily exit: if the supplied text contains no known words, it will keep recursively trying endlessly (or perhaps when some call-depth implementation limit is reached). Also, it may try the same word many times.

I'd suggest using a loop rather than recursion, and using Python's random.shuffle() method to create a single random permutation of all potential indexes. Then, try each in turn, returning as soon as a replacement is possible, or indicating failure if no replacement was possible.

Keeping your same method return-signature:

def similar_word(sentence):
    indexes = range(len(sentence))
    random.shuffle(indexes)
    for i in indexes:
        if sentence[i] in kv_model:
            return model.most_similar(sentence[i], topn=1)[0][0], i
    return None, -1  # no replacement was possible

(But separate from your question: if 100% of the time, the result of the function is used to perform a replacement, I'd just move the replacement inside the function, mutating the passed-in sentence. And the function could report how many replacements it made: 0 for failure, 1 for the usual case – and perhaps in the future could accept a parameter to request more than 1 replacement.)

Thanks a lot for your answer. There is just a problem with the `random.shuffle` method related to this [https://stackoverflow.com/questions/17649875/why-does-random-shuffle-return-none]. I have used the `random.sample(range(len(sentence)), len(sentence))` rather. — Jonor, Apr 30 '19 at 08:46
`sample()` is a good approach too! (I've also corrected my answer to use `shuffle()` properly to change a list in-place.) — gojomo, Apr 30 '19 at 18:19

Replace random word by similarity with word2vec

1 Answers1