0

I have around 500,000 32 dimensional vectors (normalized with mean=0, std=1) that I currently store in a KD tree to efficiently find the nearest neighbor. However, now I also want to be able to exclude some of the vectors from the database dynamically for some queries (the condition changes often, so re-building the tree is not an option). I want to fix some of the 32 dimensions to a certain range depending on some conditions that change during runtime.

What I currently do, is instead of looking for k=1 nearest neighbors, I look for k=50 (or more) nearest neighbors and then iterate from the closest to the farthest until I find one that matches the condition. Unfortunately that is not a very elegant solution as it requires the query to return k=50 matches even if k=1 would already have returned the one I am looking for. Also, if k=50 was too small, I need to do another query with k=500 or so and that hurts performance.

So two solutions come to my mind:

  1. Find a KD tree implementation that returns an iterator instead of a fixed-size result with k entries. The iterator would start at the nearest neighbor and then move towards neighbors with greater distance. Due to the design of a KDTree this should be very efficient. Then the tree only needs to be searched until a valid result is found and no fixed k needs to be specified. I was not able to find a Python implementation for that so far.

  2. Use a different data structure or database (MySQL for example) that is designed to do queries based on conditions. Is there any database system (I am also open for NoSQL) that supports efficient nearest neighbor search using dynamic conditions? Maybe a database that allows to use a KD tree as the index?

If nothing is yet available, I'll probably give myself a try to implement a KD tree that does what I want on my own.

EDIT: The language I am currently using is Python for proto-typing, later I will move to C# (Unity).

Simon Hessner
  • 1,757
  • 1
  • 22
  • 49
  • Do the vectors contain floats or doubles or integers ? – amirouche Dec 07 '20 at 18:24
  • Look into https://github.com/facebookresearch/faiss/wiki/Getting-started#building-an-index-and-adding-the-vectors-to-it – amirouche Dec 07 '20 at 18:29
  • 1
    The vectors contain floats.Thanks for the link to FAISS, unfortunately I can't find a way in that tool to add conditions to the NN search. Did I miss it in the docs? – Simon Hessner Dec 07 '20 at 19:27
  • Look into https://stackoverflow.com/a/64396629/140837 that is for FTS, but I think it can apply in your case. That is how I implement search with multiple criteria. Filter + Scoring in parallel is crucial otherwise it does not work (that is a map-reduce). – amirouche Dec 11 '20 at 18:03
  • 1
    Another approach I have taken for NN for spell checking at https://stackoverflow.com/a/58791875/140837 but that only works with a small alphabet. I am not sure how it would be translated to matching floats. – amirouche Dec 11 '20 at 18:05
  • Do you want me to provide an example program in python that does query the NN when n = 3 and v[0] is between 0.1 and 0.2 ? – amirouche Dec 18 '20 at 07:30
  • Yes, I think that would be very helpful! – Simon Hessner Dec 22 '20 at 21:12

1 Answers1

1

The idea is similar to how to speed up boolean keyword search with positive terms:

  • candidates selection: reduce the size of the search space as much as possible

  • scoring: compare each candidates with the query vector and only keep the best candidate vector until all candidates have been scored. That step can be done in parallel. Basically, a brute force algorithm over a reduced space.

Unlike the full text search question from above you have vectors of floats or doubles with constraints on one or more dimensions. That is a geometric problem and is most of the time encountered in Geographical Information System (GIS) except in your case instead of two, or three or even four dimensions they are 32 dimensions.

One way to make the candidates selection is to index all the vectors using a space filling curve. The constraints describe a region inside the 32 dimensional space, and you want to know what are the vectors that are in that subspace, because the nearest neighbor you are looking for is necessarily in that subspace, and it can not be outside. You can not further reduce the search space without more constraints.

Space filing curves like morton code or xz-ordering can easily be implemented inside ordered key value store.

The best explanation of how the algorithm works is:

Z-Order Indexing for Multifaceted Queries in Amazon DynamoDB: Part 1

There is an implementation of xz-ordering in scala as part of geomesa.

There is various okvs implementation, to experiment I recommend lsm-db.

amirouche
  • 7,682
  • 6
  • 40
  • 94