9

I have a large corpus of data (text) that I have converted to a sparse term-document matrix (I am using scipy.sparse.csr.csr_matrix to store sparse matrix). I want to find, for every document, top n nearest neighbour matches. I was hoping that NearestNeighbor routine in Python scikit-learn library (sklearn.neighbors.NearestNeighbor to be precise) would solve my problem, but efficient algorithms that use space partitioning data structures such as KD trees or Ball trees do not work with sparse matrices. Only brute-force algorithm works with sparse matrices (which is infeasible in my case as I am dealing with large corpus).

Is there any efficient implementation of nearest neighbour search for sparse matrices (in Python or in any other language)?

Thanks.

abhinavkulkarni
  • 2,284
  • 4
  • 36
  • 54

2 Answers2

5

Late answer: Have a look at Locality-Sensitive-Hashing

Support in scikit-learn has been proposed here and here.

Unapiedra
  • 15,037
  • 12
  • 64
  • 93
  • I can confirm LSHForest is working and sparse matrix input is supported too in year 2016, which is great. Downside is suprising slowness of this implementation and it might need a version 2, although the output of version 1 is at least correct it seems. For better speed I am trying BallTree and using SVD first to reduce dimensionality like Mathieu suggested to support the dense matrix requirement of BallTree. SVD might also help me find latent features to aid clustering with DBSCAN. Finally, cosine distance metric is LSHForest's only distance metric, which is good for many data but not all. – Geoffrey Anderson Dec 07 '16 at 14:39
3

You can try to transform your high-dimensional sparse data to low-dimensional dense data using TruncatedSVD then do a ball-tree.

Mathieu
  • 280
  • 1
  • 3
  • 10
  • Are you sure a ball tree will behave nicely with SVD output? Typically for text data, you'd want SVD to keep some 100-200 dimensions... – Fred Foo Aug 14 '13 at 18:01