- Assume I have n documents represented as unit vectors, call it X.
- I have the vector representation of one document, call it Xi.
- How can I find the closest* vector in X to Xi without brut-force search (linear time).
*Distance can be L2; proportionally equals cosine-similarity when we talk about unit vectors.
My approximate approach (constant time): 1. Sort all the documents for each vector dimension. 2. The use the sorting index to brute-force only through a subset of the data: f.e. include all the closest 1000 documents for each vector dimension, brut-force calculate L2 distance through those documents (1000) which appear close in all (or most) dimensions. (max. 1000)
I would like to know however if there is a "cleaner" exact solution like the divide and conquer algorithm for the Closest Pair of Points problem, which runs in log(n) time.
PS: Memory should scale linearly as well. But this should be fine.
Example: I store the 100 dimension vector representations for 1M documents as 32bit floats.
- Vector representations: 1M*100 dims*32bit = 3.2Gbit = 400MB
- Sorting indexes: 1M*100 sorts*32bit = 3.2Gbit = 400MB