How to speed up computation of cosine similarity between set of vectors

Question

I have a set of vectors (~30k), each of which consists of 300 elements generated by fasttext, each vector is representing the meaning of an entity, I want to calculate the similarity between all entities, so I iterate over the vectors in a nested matter, O(N^2) complexity, which is not practical in terms of time.

Can you recommend me another approach for calculating this, or how can I parallelize it?

def calculate_similarity(v1, v2):
    """
    Calculate cosine distance between two vectors
    """
    n1 = np.linalg.norm(v1)
    n2 = np.linalg.norm(v2)
    return np.dot(v1, v2) / n1 / n2


similarities = {}
for ith_entity, ith_vector in vectors.items():
    for jth_entity, jth_vector in vectors.items():
        if ith_entity == jth_entity:
            continue
        if (ith_entity, jth_entity) in similarities.keys() or (jth_entity, ith_entity) in similarities.keys():
            continue
        similarities[(ith_entity, jth_entity)] = calculate_similarity(ith_vector, jth_vector)

Not sure but would a cluster algorithm such as [KMeans](https://docs.scipy.org/doc/scipy/reference/cluster.vq.html) help here? — Bill, Feb 17 '18 at 06:01
so you want to compute n^2 numbers but not have it run in n^2 time? Or you just want to make the factor smaller? — Mad Physicist, Feb 17 '18 at 06:01
@Bill that's a valid idea, but not the direct solution to what I am looking for. — IbrahimSharaf, Feb 17 '18 at 06:04
Simple speedups ideas: 1) use symmetry as the result of i, j is the same as for j, i. i.e. run i on all items but j only from first to i-th item. 2) Normalize all items in advance to prevent normalization inside the loop. — Aguy, Feb 17 '18 at 06:04
@MadPhysicist I want either to lower the complexity to O(n*log n), or parallelize the O(n^2) calculation. — IbrahimSharaf, Feb 17 '18 at 06:06
Without additional information about the vectors, you are trying to compute n * (n - 1) / 2 values. That's non negotiably O(n^2) — Mad Physicist, Feb 17 '18 at 06:09
This question is too broad for SO. Start with your own research, try something, and ask another question if you run into a specific issue at that point. — Mad Physicist, Feb 17 '18 at 06:20
Apparently [sklearn](http://scikit-learn.org/stable/) has a [cosine_similarity function](https://stackoverflow.com/a/27046041/1609514). — Bill, Feb 17 '18 at 19:17

score 3 · Answer 1 · answered Feb 17 '18 at 09:37

You could get rid of the nested loop, which is slow, by using scipy's distance module.

Given vectors = {'k1':v1, 'k2':v2, ..., 'km':vm} with vi being a Python List of length n.

import numpy as np 
from scipy.spatial import distance

# transfrom vectors to m x n numpy array 
data = np.array(list(vectors.values())

# compute pairwise cosine distance 
pws = distance.pdist(data, metric='cosine')

pws is condensed distance matrix. It is one-dimensional and holds the distances in the following order:

pws = np.array([ (k1, k2), (k1, k3), (k1, k4), ..., (k1, km),
                           (k2, k3), (k2, k4), ..., (k2, km),
                                      ...,
                                                   (km-1, km) ])

Note also that distance.pdist calculates the cosine distance rather than cosine similarity.

Bill · Answer 2 · 2018-02-17T10:19:12.670

I had a go at vectorizing it.

import numpy as np
from itertools import combinations

np.random.seed(1)

vector_data = np.random.randn(3, 3)

v1, v2, v3 = vector_data[0], vector_data[1], vector_data[2]

def similarities_vectorized(vector_data):
    norms = np.linalg.norm(vector_data, axis=1)
    combs = np.stack(combinations(range(vector_data.shape[0]),2))
    similarities = (vector_data[combs[:,0]]*vector_data[combs[:,1]]).sum(axis=1)/norms[combs][:,0]/norms[combs][:,1]
    return combs, similarities

combs, similarities = similarities_vectorized(vector_data)

for comb, similarity in zip(combs, similarities):
    print(comb, similarity)

Output:

[0 1] -0.217095007411
[0 2] 0.894174618451
[1 2] -0.630555641519

Compare result with code from Question:

def calculate_similarity(v1, v2):
    """
    Calculate cosine distance between two vectors
    """
    n1 = np.linalg.norm(v1)
    n2 = np.linalg.norm(v2)
    return np.dot(v1, v2) / n1 / n2

def calculate_simularities(vectors):
    similarities = {}
    for ith_entity, ith_vector in vectors.items():
        for jth_entity, jth_vector in vectors.items():
            if ith_entity == jth_entity:
                continue
            if (ith_entity, jth_entity) in similarities.keys() or (jth_entity, ith_entity) in similarities.keys():
                continue
            similarities[(ith_entity, jth_entity)] = calculate_similarity(ith_vector, jth_vector)
    return similarities

vectors = {'A': v1, 'B': v2, 'C': v3}

print(calculate_simularities(vectors))

Output:

{('A', 'B'): -0.21709500741113338, ('A', 'C'): 0.89417461845058566, ('B', 'C'): -0.63055564151883581}

The vectorized version was about 3.3 times faster when I ran it on a set of 300 vectors.

UPDATE:

This version is about 50 times faster than the original:

def similarities_vectorized2(vector_data):
    norms = np.linalg.norm(vector_data, axis=1)
    combs = np.fromiter(combinations(range(vector_data.shape[0]),2), dtype='i,i')
    similarities = (vector_data[combs['f0']]*vector_data[combs['f1']]).sum(axis=1)/norms[combs['f0']]/norms[combs['f1']]
    return combs, similarities

combs, similarities = similarities_vectorized2(vector_data)

for comb, similarity in zip(combs, similarities):
    print(comb, similarity)

Output:

(0, 1) -0.217095007411
(0, 2) 0.894174618451
(1, 2) -0.630555641519

score 0 · Answer 3 · answered Oct 26 '21 at 08:12

Use a ball tree, I have used it on a very large feature vector with shape (16460,4096). Firstly construct a tree using the chunk below

from sklearn.neighbors import BallTree
tree = BallTree(features_tsvd, metric = spatial.distance.cosine)

Now to search a query in the tree try something like this:

dists, ind = tree.query(query, k=10)

How to speed up computation of cosine similarity between set of vectors

3 Answers3