What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Question

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.

Say the input matrix is:

The sparse representation is:

A = 
0, 1
0, 4
1, 2
1, 3
1, 4
2, 0
2, 1
2, 3

In Python, it's straightforward to work with the matrix-input format:

import numpy as np
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine

A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])

dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out

Gives:

array([[ 1.        ,  0.40824829,  0.40824829],
       [ 0.40824829,  1.        ,  0.33333333],
       [ 0.40824829,  0.33333333,  1.        ]])

That's fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished?

Seth yes, I edited it with your correction. Thanks. Size is currently in the tens of thousand non-zero entries, but I would like to handle 2-3 orders of magnitude greater. — zbinsd, Jul 13 '13 at 07:16

score 95 · Accepted Answer · answered Aug 23 '16 at 14:45

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:

from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

A =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)

similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))

#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)
print('pairwise sparse output:\n {}\n'.format(similarities_sparse))

Results:

pairwise dense output:
[[ 1.          0.40824829  0.40824829]
[ 0.40824829  1.          0.33333333]
[ 0.40824829  0.33333333  1.        ]]

pairwise sparse output:
(0, 1)  0.408248290464
(0, 2)  0.408248290464
(0, 0)  1.0
(1, 0)  0.408248290464
(1, 2)  0.333333333333
(1, 1)  1.0
(2, 1)  0.333333333333
(2, 0)  0.408248290464
(2, 2)  1.0

If you want column-wise cosine similarities simply transpose your input matrix beforehand:

A_sparse.transpose()

For me I get dense output no matter what I put in for dense_output argument. Any idea why that might be? This is version 0.19. — Ben, Jul 19 '18 at 15:48
Are both of your input arrays sparse? From the docs: "If False, the output is sparse if both input arrays are sparse." — Jeff, Jul 21 '18 at 03:11
@Jeff how can I calculate pair wise cosine similarity for a very large dataset, any idea? The memory crashes everytime I run the above code. — yamini goel, May 30 '19 at 11:48
Takes 1.5 sec to calculate similarity for matrix (163,000 x 2500) can we make it faster than this? — Muaaz Khalid, Oct 03 '21 at 08:23

score 53 · Answer 2 · edited Feb 17 '22 at 08:34

The following method is about 30 times faster than scipy.spatial.distance.pdist. It works pretty quickly on large matrices (assuming you have enough RAM)

See below for a discussion of how to optimize for sparsity.

import numpy as np

# base similarity matrix (all dot products)
# replace this with A.dot(A.T).toarray() for sparse representation
similarity = np.dot(A, A.T)

# squared magnitude of preference vectors (number of occurrences)
square_mag = np.diag(similarity)

# inverse squared magnitude
inv_square_mag = 1 / square_mag

# if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
inv_square_mag[np.isinf(inv_square_mag)] = 0

# inverse of the magnitude
inv_mag = np.sqrt(inv_square_mag)
    
# cosine similarity (elementwise multiply by inverse magnitudes)
cosine = similarity * inv_mag
cosine = cosine.T * inv_mag

If your problem is typical for large scale binary preference problems, you have a lot more entries in one dimension than the other. Also, the short dimension is the one whose entries you want to calculate similarities between. Let's call this dimension the 'item' dimension.

If this is the case, list your 'items' in rows and create A using scipy.sparse. Then replace the first line as indicated.

If your problem is atypical you'll need more modifications. Those should be pretty straightforward replacements of basic numpy operations with their scipy.sparse equivalents.

WARNING: toarray() OR todense() methods will convert your sparse matrix to normal 1d or 2d matrix with more than 90% values zeros. So, you'll run our of mamory just for 50K documents, 350,000 Vocabulary words — Muaaz Khalid, Oct 03 '21 at 08:21

score 16 · Answer 3 · answered Sep 30 '17 at 09:31

I have tried some methods above. However, the experiment by @zbinsd has its limitation. The sparsity of matrix used in the experiment is extremely low while the real sparsity is usually over 90%. In my condition, the sparse is with the shape of (7000, 25000) and the sparsity of 97%. The method 4 is extremely slow and I can't tolerant getting the results. I use the method 6 which is finished in 10 s. Amazingly, I try the method below and it's finished in only 0.247 s.

import sklearn.preprocessing as pp

def cosine_similarities(mat):
    col_normed_mat = pp.normalize(mat.tocsc(), axis=0)
    return col_normed_mat.T * col_normed_mat

This efficient method is linked by enter link description here

zbinsd · Answer 4 · 2017-05-15T14:38:28.897

I took all these answers and wrote a script to 1. validate each of the results (see assertion below) and 2. see which is the fastest. Code and results are below:

# Imports
import numpy as np
import scipy.sparse as sp
from scipy.spatial.distance import squareform, pdist
from sklearn.metrics.pairwise import linear_kernel
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity

# Create an adjacency matrix
np.random.seed(42)
A = np.random.randint(0, 2, (10000, 100)).astype(float).T

# Make it sparse
rows, cols = np.where(A)
data = np.ones(len(rows))
Asp = sp.csr_matrix((data, (rows, cols)), shape = (rows.max()+1, cols.max()+1))

print "Input data shape:", Asp.shape

# Define a function to calculate the cosine similarities a few different ways
def calc_sim(A, method=1):
    if method == 1:
        return 1 - squareform(pdist(A, metric='cosine'))
    if method == 2:
        Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
        return np.dot(Anorm, Anorm.T)
    if method == 3:
        Anorm = A / np.linalg.norm(A, axis=-1)[:, np.newaxis]
        return linear_kernel(Anorm)
    if method == 4:
        similarity = np.dot(A, A.T)

        # squared magnitude of preference vectors (number of occurrences)
        square_mag = np.diag(similarity)

        # inverse squared magnitude
        inv_square_mag = 1 / square_mag

        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0

        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag)

        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = similarity * inv_mag
        return cosine.T * inv_mag
    if method == 5:
        '''
        Just a version of method 4 that takes in sparse arrays
        '''
        similarity = A*A.T
        square_mag = np.array(A.sum(axis=1))
        # inverse squared magnitude
        inv_square_mag = 1 / square_mag

        # if it doesn't occur, set it's inverse magnitude to zero (instead of inf)
        inv_square_mag[np.isinf(inv_square_mag)] = 0

        # inverse of the magnitude
        inv_mag = np.sqrt(inv_square_mag).T

        # cosine similarity (elementwise multiply by inverse magnitudes)
        cosine = np.array(similarity.multiply(inv_mag))
        return cosine * inv_mag.T
    if method == 6:
        return cosine_similarity(A)

# Assert that all results are consistent with the first model ("truth")
for m in range(1, 7):
    if m in [5]: # The sparse case
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(Asp, method=m))
    else:
        np.testing.assert_allclose(calc_sim(A, method=1), calc_sim(A, method=m))

# Time them:
print "Method 1"
%timeit calc_sim(A, method=1)
print "Method 2"
%timeit calc_sim(A, method=2)
print "Method 3"
%timeit calc_sim(A, method=3)
print "Method 4"
%timeit calc_sim(A, method=4)
print "Method 5"
%timeit calc_sim(Asp, method=5)
print "Method 6"
%timeit calc_sim(A, method=6)

Results:

Input data shape: (100, 10000)
Method 1
10 loops, best of 3: 71.3 ms per loop
Method 2
100 loops, best of 3: 8.2 ms per loop
Method 3
100 loops, best of 3: 8.6 ms per loop
Method 4
100 loops, best of 3: 2.54 ms per loop
Method 5
10 loops, best of 3: 73.7 ms per loop
Method 6
10 loops, best of 3: 77.3 ms per loop

score 4 · Answer 5 · answered Jun 13 '16 at 00:30

4

Hi you can do it this way

    temp = sp.coo_matrix((data, (row, col)), shape=(3, 59))
    temp1 = temp.tocsr()

    #Cosine similarity
    row_sums = ((temp1.multiply(temp1)).sum(axis=1))
    rows_sums_sqrt = np.array(np.sqrt(row_sums))[:,0]
    row_indices, col_indices = temp1.nonzero()
    temp1.data /= rows_sums_sqrt[row_indices]
    temp2 = temp1.transpose()
    temp3 = temp1*temp2

answered Jun 13 '16 at 00:30

Vaali

151
2
11

+1, this works great. I rewrote this as a function in another answer and tested it against some other algorithms. – David Sep 14 '18 at 17:04

David · Answer 6 · 2018-09-14T17:02:24.203

Building off of Vaali's solution:

def sparse_cosine_similarity(sparse_matrix):
    out = (sparse_matrix.copy() if type(sparse_matrix) is csr_matrix else
           sparse_matrix.tocsr())
    squared = out.multiply(out)
    sqrt_sum_squared_rows = np.array(np.sqrt(squared.sum(axis=1)))[:, 0]
    row_indices, col_indices = out.nonzero()
    out.data /= sqrt_sum_squared_rows[row_indices]
    return out.dot(out.T)

This takes a sparse matrix (preferably a csr_matrix) and returns a csr_matrix. It should do the more intensive parts using sparse calculations with pretty minimal memory overhead. ~~I haven't tested it extensively though, so caveat emptor~~ (Update: I feel confident in this solution now that I've tested and benchmarked it)

Also, here is the sparse version of Waylon's solution in case it helps anyone, not sure which solution is actually better.

def sparse_cosine_similarity_b(sparse_matrix):
    input_csr_matrix = sparse_matrix.tocsr()
    similarity = input_csr_matrix * input_csr_matrix.T
    square_mag = similarity.diagonal()
    inv_square_mag = 1 / square_mag
    inv_square_mag[np.isinf(inv_square_mag)] = 0
    inv_mag = np.sqrt(inv_square_mag)
    return similarity.multiply(inv_mag).T.multiply(inv_mag)

Both solutions seem to have parity with sklearn.metrics.pairwise.cosine_similarity

:-D

Update:

Now I have tested both solutions against my existing Cython implementation: https://github.com/davidmashburn/sparse_dot/blob/master/test/benchmarks_v3_output_table.txt and it looks like the first algorithm performs the best of the three most of the time.

score 1 · Answer 7 · edited Jan 26 '20 at 19:32

1

You should check out scipy.sparse. You can apply operations on those sparse matrices just like how you use a normal matrix.

edited Jan 26 '20 at 19:32

nucsit026

652
7
16

answered Jul 13 '13 at 05:41

cheeyos

671
4
11

3

`scipy.sparse` don't support that kind of operation. – Igor Medeiros Oct 12 '13 at 15:43

score 0 · Answer 8 · answered Mar 04 '19 at 21:59

def norm(vector):
    return sqrt(sum(x * x for x in vector))    

def cosine_similarity(vec_a, vec_b):
        norm_a = norm(vec_a)
        norm_b = norm(vec_b)
        dot = sum(a * b for a, b in zip(vec_a, vec_b))
        return dot / (norm_a * norm_b)

This method seems to be somewhat faster than using sklearn's implementation if you pass in one pair of vectors at a time.

moshik · Answer 9 · 2019-03-11T18:17:33.693

0

I suggest to run in two steps:

1) generate mapping A that maps A:column index->non zero objects

2) for each object i (row) with non-zero occurrences(columns) {k1,..kn} calculate cosine similarity just for elements in the union set A[k1] U A[k2] U.. A[kn]

Assuming a big sparse matrix with high sparsity this will gain a significant boost over brute force

edited Mar 11 '19 at 18:17

answered Mar 11 '19 at 13:40

moshik

1,424
1
10
6

Arthur Kim · Answer 10 · 2022-11-02T03:58:53.830

@jeff 's solution is changed

As version of scikit-learn 1.1.2, you don't need to use scipy's sparse before cosine_similarity.

All you need is cosine_similarity

from typing import Tuple

import numpy as np
import perfplot
import scipy
from sklearn.metrics.pairwise import cosine_similarity as cosine_similarity_sklearn_internal
from scipy import spatial
from scipy import sparse
import sklearn.preprocessing as pp

target_dtype = "float16"


class prettyfloat(float):
    def __repr__(self):
        return "%.2f" % self


def cosine_similarity_sklearn(x):
    return cosine_similarity_sklearn_internal(x)


def cosine_similarity_sklearn_sparse(x):
    x_sparse = sparse.csr_matrix(x)

    return cosine_similarity_sklearn_internal(x_sparse)


def cosine_similarity_einsum(x, y=None):
    """
    Calculate the cosine similarity between two vectors.
    if x == y, only use x
    """
    # cosine_similarity in einsum notation without astype
    normed_x = x / np.linalg.norm(x, axis=1)[:, None]
    normed_y = y / np.linalg.norm(y, axis=1)[:, None] if y else normed_x
    return np.einsum("ik,jk->ij", normed_x, normed_y)


def cosine_similarity_scipy(x, y=None):
    """
    Calculate the cosine similarity between two vectors.
    if x == y, only use x
    """
    return 1 - spatial.distance.cosine(x, x)


def setup_n(n) -> Tuple[np.ndarray, np.ndarray]:
    nd_arr = np.random.randn(int(2 ** n), 512).astype(target_dtype)
    return nd_arr


def equality_check(a, b):
    if type(a) != np.ndarray:
        a = a.todense()
    if type(b) != np.ndarray:
        b = b.todense()

    return np.isclose(a.astype(target_dtype), b.astype(target_dtype), atol=1e-3).all()


fig = perfplot.show(
    setup=setup_n,
    n_range=[k for k in range(1, 10)],
    kernels=[
        cosine_similarity_sklearn,
        cosine_similarity_sklearn_sparse,
        cosine_similarity_einsum,
        # cosine_similarity_scipy,
    ],
    labels=["sk-def", "sk+sparse", "einsum"],
    logx=False,
    logy=False,
    xlabel='2^n',
    equality_check=equality_check,
)

Using perfplot, it show, `from typing import Tuple

import numpy as np import perfplot import scipy from sklearn.metrics.pairwise import cosine_similarity` is the best.

in scikit-learn==1.1.2,1.1.3

It can be different result in float64 and float16.

For float64,
For float16,

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

10 Answers10

Linked