2
  1. Assume I have n documents represented as unit vectors, call it X.
  2. I have the vector representation of one document, call it Xi.
  3. How can I find the closest* vector in X to Xi without brut-force search (linear time).

*Distance can be L2; proportionally equals cosine-similarity when we talk about unit vectors.

My approximate approach (constant time): 1. Sort all the documents for each vector dimension. 2. The use the sorting index to brute-force only through a subset of the data: f.e. include all the closest 1000 documents for each vector dimension, brut-force calculate L2 distance through those documents (1000) which appear close in all (or most) dimensions. (max. 1000)

I would like to know however if there is a "cleaner" exact solution like the divide and conquer algorithm for the Closest Pair of Points problem, which runs in log(n) time.

PS: Memory should scale linearly as well. But this should be fine.

Example: I store the 100 dimension vector representations for 1M documents as 32bit floats.

  • Vector representations: 1M*100 dims*32bit = 3.2Gbit = 400MB
  • Sorting indexes: 1M*100 sorts*32bit = 3.2Gbit = 400MB
  • Does this answer your question? [Best data structure for high dimensional nearest neighbor search](https://stackoverflow.com/questions/32152055/best-data-structure-for-high-dimensional-nearest-neighbor-search) – Lior Kogan May 06 '20 at 08:29

2 Answers2

1

As far as I know, there is no algorithm that will work in O(log n) worst-case. However, for more ore less randomly distributed points, there are some exact space partitioning methods, that work in O(log n) average. If your set of documents X is immutable, you can use k-d tree. If you need to support modifications, you should try R* tree, which is much more complicated but supports insertions and deletions to X, also it has more consistent query time (but still averaging to O(log n)). Both of these structures use linear space.

Roman Svistunov
  • 1,069
  • 6
  • 11
  • From the Wikipedia article on k-d tree: "As a general rule, if the dimensionality is k, the number of points in the data, n, should be n ≫ 2k. Otherwise, when k-d trees are used with high-dimensional data, most of the points in the tree will be evaluated and the efficiency is no better than exhaustive search, and, if a good-enough fast answer is required, approximate nearest-neighbour methods should be used instead." – Danylo Mysak Aug 26 '22 at 07:07
1

Response to my own question:

The best solution I found so far is Approximate Nearest Neighbor Search (Oh Yeah) from Spotify: https://github.com/spotify/annoy

Other than that, sklearn also provides some functionality for fast approximate nearest neighbor search. https://scikit-learn.org/stable/modules/neighbors.html