0

I would like take word "book" (for example) get its vector representation, call it v_1 and find all words whose vector representation is within ball of radius r of v_1 i.e. ||v_1 - v_i||<=r, for some real number r.

I know gensim has most_similar function, which allows to state number of top vectors to return, but it is not quite what I need. I surely can use brute force search and get the answer, but it will be to slow.

user1700890
  • 7,144
  • 18
  • 87
  • 183
  • 1
    calculate euclidean distance to the point, then filter out records farther than r – Marat Sep 16 '19 at 16:56
  • @Marat, thank you, this looks like brute force to me. I will need to loop through all words in model (yet need to find out how to do it). Anything faster? – user1700890 Sep 16 '19 at 17:00
  • 1
    It isn't that bad. Typically vocab size is not that big and with properly vectorized operations that will take a fraction of a second. There are some optimizations for multiple searches (like R-trees), but for a single search that's your only option – Marat Sep 16 '19 at 18:34

1 Answers1

1

If you call most_similar() with a topn=0, it will return the raw unsorted cosine-similarities to all other words known to the model. (These similarities will not be in tuples with the words, but simply in the same order as the words in the index2entity property.)

You could then filter those similarities for those higher than your preferred threshold, and return just those indexes/words, using a function like numpy's argwhere.

For example:

target_word = 'apple'
threshold = 0.9
all_sims = wv.most_similar(target_word, topn=0)
satisfactory_indexes = np.argwhere(all_sims > threshold)
satisfactory_words = [wv.index2entity[i] for i in satisfactory_indexes]
gojomo
  • 52,260
  • 14
  • 86
  • 115