Comparing NumPy Arrays for Similarity

Question

I have a target NumPy array with shape (300,) and a set of candidate arrays also of shape (300,). These arrays are Word2Vec representations of words; I'm trying to find the candidate word that is most similar to the target word using their vector representations. What's the best way to find the candidate word that is most similar to the target word?

One way to do this is to sum up the absolute values of the element-wise differences between the target word and the candidate words, then select the candidate word with the lowest overall absolute difference. For example:

candidate_1_difference = np.subtract(target_vector, candidate_vector)
candidate_1_abs_difference = np.absolute(candidate_1_difference)
candidate_1_total_difference = np.sum(candidate_1_abs_difference)

Yet, this seems clunky and potentially wrong. What's a better way to do this?

Edit to include example vectors:

import numpy as np
import gensim

path = 'https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz'


def func1(path):
    #Limited to 50K words to reduce load time
    model = gensim.models.KeyedVectors.load_word2vec_format(path, binary=True, limit=50000)
    context =  ['computer','mouse','keyboard']
    candidates = ['office','house','winter']
    vectors_to_sum = []
    for word in context:
        vectors_to_sum.append(model.wv[word])
    target_vector = np.sum(vectors_to_sum)

    candidate_vector = candidates[0]
    candidate_1_difference = np.subtract(target_vector, candidate_vector)
    candidate_1_abs_difference = np.absolute(candidate_1_difference)
    candidate_1_total_difference = np.sum(candidate_1_abs_difference)
    return candidate_1_total_difference

Could you share some sample arrays? – yatu Jul 05 '19 at 19:34 — yatu, Jul 05 '19 at 19:34
@yatu- sure, added additional code for context. – Caerus Jul 05 '19 at 19:54 — Caerus, Jul 05 '19 at 19:54

score 2 · Accepted Answer · answered Jul 05 '19 at 19:57

What you have is basically correct. You are calculating the L1-norm, which is the sum of absolute differences. Another more common option is to calculate the euclidean norm, or the L2-norm, which is the familiar distance measure of square root of sum of squares.

You can use numpy.linalg.norm to calculate the different norms, which by default calculates the L-2 norm for vectors.

distance = np.linalg.norm(target_vector - candidate_vector)

If you have one target vector and multiple candidate vectors stored in a list, the above still works, but you need to specify the axis for norm, and then you get a vector of norms, one for each candidate vector.

for list of candidate vectors:

distance = np.linalg.norm(target_vector - np.array(candidate_vector), axis=1)

Comparing NumPy Arrays for Similarity

1 Answers1

Linked