1

Given: A huge set of N-dimensional vectors - { V1, V2, V3, ..., Vm } Example of a vector:

[72, 100, 34, 45, 87, 123, 99, 32] // N = 8

Input: As an input, we are given another vector with the same dimension as the set described above. Let's call this vector X.

Objective: Find the most similar (or top K similar vectors, K is relatively small)from the given set for vector X provided. Similarity is defined as https://en.wikipedia.org/wiki/Euclidean_distance.

I am looking for an approach that can give me O(log M) complexity, where M is a number of vectors in the set.

Notes, that N could be relatively big (like 100, 500, 1000). M is huge (like multiple millions or billions).

I am looking into https://en.wikipedia.org/wiki/Locality-sensitive_hashing.

Stephen L.
  • 509
  • 2
  • 14
  • My understanding is that performance of a given method will depend on the characteristics of the `Vk` vectors. Are they random-like or do you have more information on them? – Damien Nov 06 '19 at 08:54
  • They are random-like. – Stephen L. Nov 06 '19 at 09:34
  • The problem is that we have too many axes (500, 1000) and all of them are equal. Sorting vectors by one axis will simply just reduce the dimension of the problem. – Stephen L. Nov 06 '19 at 13:42
  • 1
    @StephenL. and if you sort all of them you got Octree and `O(N+log(M))` result see my answer. Without knowing more about what you are doing I can only guess but I got the feeling that: euclidean distance is not the best metric ... maybe better would be correlation coefficient , or dot product ... – Spektre Nov 07 '19 at 09:25
  • @Spektre you're right, Euclidian distance might not be the perfect metric for my problem. Currently, I consider using https://en.wikipedia.org/wiki/Cosine_similarity – Stephen L. Nov 08 '19 at 09:33
  • 1
    @StephenL. that is the dot product I mentioned its like angle between the vectors, did not know it have a name :) the question is if magnitude should be significant or not ... if yes create an uniform grid (of size of your threshold) and compute the vectors to each point as a bin ... if not you could reduce to something like this [How to distribute points evenly on the surface of hyperspheres in higher dimensions?](https://stackoverflow.com/a/57240140/2521214) but in such high dimensionality it would be insane ... maybe approximate with hyper box instead and normalize the resulting vectors... – Spektre Nov 08 '19 at 10:19
  • @Spektre, I was thinking about hyper grid as well. Do you know any solutions that work out of the box? For instance for (lat, long) ,basically, N = 2 here is the solution - https://redis.io/commands/geohash – Stephen L. Nov 09 '19 at 06:58
  • @StephenL. appart of mine approach in link above there is also the Fibonacci lattice but that is **very hard math** in **ND** and you will not find any pre-computed solutions/equations anywhere ... so hyper grid is your easiest option (or just the grids on the faces of the hyper cube enveloping your hyper space if the vectors are directions) yes that would have some angular distortions but its foul proof to implement as using different metrics in ND is hard due to multi-connectivity of the surfaces that we can not comprehend naturally. – Spektre Nov 09 '19 at 08:37

1 Answers1

1

Naive approach is O(N.M) so here few options:

  1. Ordering by one axis O(N.log(M))

    1. (Index) sort the list by one axis

      which is O(N.M.log(M)) but that is done just once.

    2. Binary search first vector where ordered axis have value>=x-threshold

      this is O(N.log(M))

    3. linearly search vectors until ordered axis have value<=x+threshold

      this should be near O(N.K) and test all processed vectors if similar to your selected one. If yes add it to the solution list.

  2. Ordering by Locality-sensitive hashing O(N+log(M))

    Yes this would lead into O(N+log(M)) however with false positives and negatives so unless you can miss solutions is this a no go as you would need to test all vectors just to be sure.

  3. Ordering by feature O(N+log(M))

    this is similar to #2 but instead of using hash you use feature of the data relevant to similarity. It can be any valid one for the comparison. Thanks to that there are no false positives nor false negatives.

    You did not specify what is the meaning of the data in vector nor any ranges so I can only guess here. But you defined similarity as euclidean distance so our best feature would be position.

    So you can create Octree to spatialy reorder your data. From that you just take you input vector find the bucket where it is and search all buckets near by up to some threshold distance ...

    If you set the bucket size to your threshold distance then you search only up to first neighboring buckets (8+1 in total).

    getting bucket index from vector should be in O(N) converting this into O(N+log(M))

Spektre
  • 49,595
  • 11
  • 110
  • 380