I have a target NumPy array with shape (300,) and a set of candidate arrays also of shape (300,). These arrays are Word2Vec representations of words; I'm trying to find the candidate word that is most similar to the target word using their vector representations. What's the best way to find the candidate word that is most similar to the target word?
One way to do this is to sum up the absolute values of the element-wise differences between the target word and the candidate words, then select the candidate word with the lowest overall absolute difference. For example:
candidate_1_difference = np.subtract(target_vector, candidate_vector)
candidate_1_abs_difference = np.absolute(candidate_1_difference)
candidate_1_total_difference = np.sum(candidate_1_abs_difference)
Yet, this seems clunky and potentially wrong. What's a better way to do this?
Edit to include example vectors:
import numpy as np
import gensim
path = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'
def func1(path):
#Limited to 50K words to reduce load time
model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True, limit=50000)
context = ['computer','mouse','keyboard']
candidates = ['office','house','winter']
vectors_to_sum = []
for word in context:
vectors_to_sum.append(model.wv[word])
target_vector = np.sum(vectors_to_sum)
candidate_vector = candidates[0]
candidate_1_difference = np.subtract(target_vector, candidate_vector)
candidate_1_abs_difference = np.absolute(candidate_1_difference)
candidate_1_total_difference = np.sum(candidate_1_abs_difference)
return candidate_1_total_difference