0

How can optimze below code. take example of

x_tsvd = [[1,2,3,4],[5,4,5,6],[2,4,5,5]]
svd_tfidf=[[11,3,4,5],[4,4,6,7],[6,6,3,5]]

The number of row are very large((4 million) above is just an example.Basically i have to calculate cosine similarity for each row of x_tsvd with svd_tfidf. Is there any way i can optimize it further to speed up.

for i in range(len(x_tsvd)):
    array_=[]
    for j in range (len(svd_tfidf)):
        cosine_similarity_=np.dot(x_tsvd[i],svd_tfidf[j])/(norm(x_tsvd[i])*norm(svd_tfidf[j]))
        array_.append(cosine_similarity_)
    index=np.array(array_).argsort()
sudojarvis
  • 37
  • 5
  • "Basically i have to calculate cosine similarity for each row of x_tsvd with svd_tfidf." To be clear: you wish to calculate 16 trillion results and store them all? How much memory do you expect this to require? – Karl Knechtel Jul 13 '22 at 04:16
  • How are you going to store the result? Technically you will have an index (4 million vector) for each `i` (1 to 4 million), usually this won't fit the memory of a typical PC. Are you going to store them as a file? – Kota Mori Jul 13 '22 at 04:17
  • yes i will save it – sudojarvis Jul 13 '22 at 04:24
  • Do you really need a 4Mx4M matrix? Or, for each element in x_tsvd are you trying to find the N nearest neighbors (say 10) in svd_tfidf for each element in x_tsvd? In the later case, you have a 4M x N (say 4M x 10) size data structure which is much more manageable. See [Using cosine distance with scikit learn KNeighborsClassifier](https://stackoverflow.com/questions/34144632/using-cosine-distance-with-scikit-learn-kneighborsclassifier) – DarrylG Jul 13 '22 at 05:12
  • I want or part that yo have metioned – sudojarvis Jul 13 '22 at 06:45

2 Answers2

1

Mainly through the following points to accelerate:

  1. numpy.linalg.norm function can calculate the norm along the axis. For 2D arrays, specify axis as 1 to calculate the Euclidean norm for each row.
  2. By broadcasting, the elements between two norm vectors can be multiplied in pairs.
  3. numpy.ndarray.dot method can be used between 2D arrays to calculate inner product in pairs.
  4. numpy.ndarray.argsort method can sort along the axis.
>>> x_tsvd = [[1, 2, 3, 4], [5, 4, 5, 6], [2, 4, 5, 5]]
>>> svd_tfidf = [[11, 3, 4, 5], [4, 4, 6, 7], [6, 6, 3, 5]]
>>> x = np.array(x_tsvd)
>>> y = np.array(svd_tfidf)
>>> norm_prod = np.linalg.norm(x, axis=1)[:, None] * np.linalg.norm(y, axis=1)
>>> similarities = x.dot(y.T) / norm_prod
>>> similarities.argsort(axis=1)
array([[0, 2, 1],
       [0, 2, 1],
       [0, 2, 1]], dtype=int64)
Mechanic Pig
  • 6,756
  • 3
  • 10
  • 31
0

Rather than creating 4Mx4M similarity matrix we can create a neighborhood similarity matrix, where:

  • K is the k-nearest neighbor similarity of each observations in x_tsvd to svd_tfid
  • Resulting similarity is 4MxK

Code

import numpy as np
from sklearn.preprocessing import normalize
from sklearn.neighbors import BallTree

def nearest_neighbors(x, y, k = 3):
    '''
        For each element of x, finds its k-nearest neighbors in y
        
        Technique:
            Neighbors are based upon cosine similarity
            Converts cosine similarity to a Eucliean distance
            based upon:
                https://stackoverflow.com/questions/34144632/using-cosine-distance-with-scikit-learn-kneighborsclassifier/34145444#34145444
            
        
        Returns:
            - Distances to k nearest neighbors
            - Indices to to k nearest neighbors
    '''
    # Normalize each sample of input
    x = normalize(x_tsvd, axis = 1)  
    y = normalize(svd_tfidf, axis = 1) 
    
    # Form the BallTree
    tree = BallTree(x, metric='euclidean')

    # Get distances and indices to k nearest neighbors
    distances, indices = tree.query(y, k = k)
    
    # map distnaces to similarity
    # transform distances back to similarity i.e. similariy = (2 - d**2)/2
    return (2 - distances*distances)/2, indices

Test

x_tsvd = [[1,2,3,4],[5,4,5,6],[2,4,5,5]]
svd_tfidf=[[11,3,4,5],[4,4,6,7],[6,6,3,5]]
similarity, indices = nearest_neighbors(x_tsvd, svd_tfidf, k = 2)

print(f'Similarity Matrix\n {similarity}\n')
print(f'Index of Matches of x_tsvd in svd_tfidf\n {indices}')

Output

Similarity Matrix
 [[0.88590616 0.72207119]
 [0.98862307 0.98344042]
 [0.95209915 0.88229057]]

Index of Matches of x_tsvd in svd_tfidf
 [[1 2]
 [1 2]
 [1 2]]

Explanation

We use Balltree to find K nearest neighbor. However:

  • Similarity is not a distance metric
  • We use the technique from Using cosine distance with scikit learn KNeighborsClassifier to transform similarity to distances
  • Create Balltree using Euclidean distance on normalized data
  • Euclidean distance of each x, y vector using normalized data will be sqrt(2 − 2*x^T y)
  • With x, y normalized, x^T y is the similarity. Thus, we have transformed similarity to a distance metric
  • Query Balltree to obtain distances and indices of K nearest neighbors x_tsvd observations in svd_tfidf
  • Convert distances back to similarity using inverse transformation
DarrylG
  • 16,732
  • 2
  • 17
  • 23