Given a sparse matrix (created using scipy.sparse.csr_matrix
) of size NxN (N = 900,000), I'm trying to find, for every row in testset, top k nearest neighbors (sparse row vectors from the input matrix) using a custom distance metric. Basically, each row of the input matrix represents an item and for each item (row) in testset, I need to find it's knn.
Attempts:
Tried using
sklearn.neighbors.NearestNeighbor
. However, it appears that sklearn doesn't take callable metric function as input when dealing with sparse matrices:ValueError: metric '<function <lambda> at 0x7f92ce221938>' not valid for sparse input
Currently trying to use facebookresearch/pysparnn (looks really promising!). It has a certain provision for implementing one's own custom distance class. However, after execution, it's taking quite long to build the index (still running after 24 hrs) and as mentioned by the author, it seems that
using distance types from
scipy.spatial.distance.cdist
(or sklearn distance metrics) is much slower than what is currently in pysparnn.We're in process of debugging this performance issue of sklearn/scipy distance metrics by writing something custom.
I'd like to know if there is any other efficient implementation of nearest neighbor search for sparse matrices which provides for using a custom distance metric?
(Will be executed on a server with 64 GB RAM, 12 cores)
Thanks!