Rather than creating 4Mx4M similarity matrix we can create a neighborhood similarity matrix, where:
- K is the k-nearest neighbor similarity of each observations in x_tsvd to svd_tfid
- Resulting similarity is 4MxK
Code
import numpy as np
from sklearn.preprocessing import normalize
from sklearn.neighbors import BallTree
def nearest_neighbors(x, y, k = 3):
'''
For each element of x, finds its k-nearest neighbors in y
Technique:
Neighbors are based upon cosine similarity
Converts cosine similarity to a Eucliean distance
based upon:
https://stackoverflow.com/questions/34144632/using-cosine-distance-with-scikit-learn-kneighborsclassifier/34145444#34145444
Returns:
- Distances to k nearest neighbors
- Indices to to k nearest neighbors
'''
# Normalize each sample of input
x = normalize(x_tsvd, axis = 1)
y = normalize(svd_tfidf, axis = 1)
# Form the BallTree
tree = BallTree(x, metric='euclidean')
# Get distances and indices to k nearest neighbors
distances, indices = tree.query(y, k = k)
# map distnaces to similarity
# transform distances back to similarity i.e. similariy = (2 - d**2)/2
return (2 - distances*distances)/2, indices
Test
x_tsvd = [[1,2,3,4],[5,4,5,6],[2,4,5,5]]
svd_tfidf=[[11,3,4,5],[4,4,6,7],[6,6,3,5]]
similarity, indices = nearest_neighbors(x_tsvd, svd_tfidf, k = 2)
print(f'Similarity Matrix\n {similarity}\n')
print(f'Index of Matches of x_tsvd in svd_tfidf\n {indices}')
Output
Similarity Matrix
[[0.88590616 0.72207119]
[0.98862307 0.98344042]
[0.95209915 0.88229057]]
Index of Matches of x_tsvd in svd_tfidf
[[1 2]
[1 2]
[1 2]]
Explanation
We use Balltree to find K nearest neighbor. However:
- Similarity is not a distance metric
- We use the technique from Using cosine distance with scikit learn KNeighborsClassifier to transform similarity to distances
- Create Balltree using Euclidean distance on normalized data
- Euclidean distance of each x, y vector using normalized data will be sqrt(2 − 2*x^T y)
- With x, y normalized, x^T y is the similarity. Thus, we have transformed similarity to a distance metric
- Query Balltree to obtain distances and indices of K nearest neighbors x_tsvd observations in svd_tfidf
- Convert distances back to similarity using inverse transformation