5

Given a sparse matrix (created using scipy.sparse.csr_matrix) of size NxN (N = 900,000), I'm trying to find, for every row in testset, top k nearest neighbors (sparse row vectors from the input matrix) using a custom distance metric. Basically, each row of the input matrix represents an item and for each item (row) in testset, I need to find it's knn.

Attempts:

  • Tried using sklearn.neighbors.NearestNeighbor. However, it appears that sklearn doesn't take callable metric function as input when dealing with sparse matrices:

    ValueError: metric '<function <lambda> at 0x7f92ce221938>' not valid for sparse input
    
  • Currently trying to use facebookresearch/pysparnn (looks really promising!). It has a certain provision for implementing one's own custom distance class. However, after execution, it's taking quite long to build the index (still running after 24 hrs) and as mentioned by the author, it seems that

    using distance types from scipy.spatial.distance.cdist (or sklearn distance metrics) is much slower than what is currently in pysparnn.

    We're in process of debugging this performance issue of sklearn/scipy distance metrics by writing something custom.

I'd like to know if there is any other efficient implementation of nearest neighbor search for sparse matrices which provides for using a custom distance metric?
(Will be executed on a server with 64 GB RAM, 12 cores)

Thanks!

Tejas Shah
  • 638
  • 6
  • 17
  • What does an NxN matrix represent? It might help if you demonstrated this with, say a 10x10 dense array. – hpaulj Mar 23 '17 at 23:51
  • @hpaulj question's metadata re-framed once again. Let me know if any additional information is required and if demonstration using a dense array is still needed for clarification. – Tejas Shah Mar 24 '17 at 01:11
  • 1
    Do you need exact nearest neighbours or would an approximation do? – ali_m Mar 24 '17 at 01:18
  • 1
    @ali_m exact nearest neighbors would be preferred since the accuracy might get affected at later stages. However, I'm open for alternatives using approximations too. – Tejas Shah Mar 24 '17 at 01:49
  • 1
    I just talked about comparing sparse rows here - http://stackoverflow.com/q/42905909/901925 – hpaulj Mar 24 '17 at 03:16
  • When you compare two sparse rows, will you use the intersection of nonzero values, or their union? In other words is the comparison more like multiplication, or more like addition/subtraction? – hpaulj Mar 24 '17 at 03:27
  • @hpaulj while comparing two sparse rows, I'll be having arithmetic operations on their overlapping non-zero values (intersection). For instance, the distance metric to be used is [np.minimum](https://docs.scipy.org/doc/numpy/reference/generated/numpy.minimum.html) over the data of two sparse vectors and then getting sum of the resultant array (= custom distance value). I hope I'm able to clarify the intent. – Tejas Shah Mar 24 '17 at 03:50
  • @ali_m any headers on the approximate nearest neighbors approach? – Tejas Shah Mar 29 '17 at 21:46
  • @hpaulj would you like to have any further inputs for a better understanding of the problem? – Tejas Shah Mar 29 '17 at 21:52
  • I think this might be useful: https://github.com/scikit-learn/scikit-learn/issues/9199 – Julian Jul 04 '17 at 20:12

0 Answers0