Range search in d-space with discrete coordinates

Question

I would like to develop a range searching algorithm that reports all points within a given distance of a query point.

The points are specified by d integer coordinates in a tiny range, say up to 6 bits per dimension (range 0..63), for a total bit count not exceeding 60 bits.

The distance metric is Manhattan or Euclidean (up to you), i.e. the sum of absolute or squared coordinate differences. In the special case of a single bit per dimension, it amounts to the Hamming distance.

There can be up to a million points.

Are you aware of a practical data structure that supports fast queries, say O(Log²(n)+k) or similar (with space O(n)) in such conditions ? A reasonable preprocessing time (subquadratic) is also required.

k-D trees are a first option, but they don't exploit the finiteness of the coordinates and are likely to perform poorly in high dimensions, I am afraid.

The case of a single bit per coordinate is especially interesting. Even partial solutions are welcome.

I think the best shot is a VP Tree, I removed my previous answer and added an answer that demonstrates correct output with a javascript implementation. If further optimization is requred later, the VP Tree algorithm can be modiied to be in-place with respect to the input array, using range designations in the tree nodes and using a quicksort-like partitioning of the array. — Louis Ricci, Sep 25 '15 at 16:59
Interesting. In the meantime, I heard of the BK-trees. Will be good if I can try and compare the two in my use case. That will take long time. — , Sep 25 '15 at 19:10

Louis Ricci · Answer 1 · 2015-09-25T16:51:47.357

After some thought (and prodding by @YvesDaoust) using a VP Tree (Vantage Point Tree https://en.wikipedia.org/wiki/Vantage-point_tree) is probably the best solution.

VP Tree is a BSP where the left nodes are inside the distance and the right nodes are outside of the distance. This works for single bit per dimension and multiple bit per dimension (only the distance formula would change. The distance is a per tree node threshold/radius. Querying involves recursing through the tree getting current node value's distance to the query value and comparing that result with the query distance.

JSFiddle http://jsfiddle.net/fgq1rfLk/

var DIMS = 16;
var BITS = 1;
var MASK = (Math.pow(2, BITS) - 1)|0;
var SIZE = DIMS * BITS;
var list = [];
var tree = null;
//
// set bit count (population count)
function popCnt(x) {
    x = x - ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    x = x + (x >> 16);
    return x & 0x0000003F;
}
//
// manhattan distance
function dist(a, b) {
    if(BITS == 1) {
        return popCnt(a ^ b);
    }
    var result = 0;
    for(var i=0; i<DIMS; i++) {
        var shr = i * BITS;
        result += Math.abs(((a >> shr) & MASK) - ((b >> shr) & MASK));
    }
    return result;
}
//
// Vantage point tree
// max size of tree leaf nodes
VP_LEAF_SIZE = 32;
// need to choose a reasonable maximum distance
VP_DISTANCE = (BITS === 1) ? SIZE : 32;
function VPTree(data) {
    this.radius = null;
    this.center = null;
    this.values = null;
    this.inside = null;
    this.outside = null;
    // 
    var n = data.length;
    var r = data[0];
    // leaf node?
    if(n <= VP_LEAF_SIZE || n <= 1) {
        this.values = [].concat(data);
        return this;
    }
    this.center = r;
    // process data for counts at all possible distances
    var buckets = Array(VP_DISTANCE + 1);
    for(var i=0; i<=VP_DISTANCE; i++) { 
        buckets[i] = 0; 
    }
    // distance counts
    for(var i=0; i<n; i++) {
        var v = data[i];
        var d = dist(r, v);
        if(d < VP_DISTANCE) {
            buckets[d]++;
        } else {
            buckets[VP_DISTANCE]++;
        }
    }
    // distance offsets
    var sum = 0;
    for(var i=0; i<=VP_DISTANCE; i++) { 
        buckets[i] = (sum += buckets[i]); 
    }
    // pivot index
    var median = n >> 1;
    var pivot = 1;
    for(var i=1; i<=VP_DISTANCE; i++) {
        if(buckets[i] > median) {
            pivot = (i > 1 && median - buckets[i - 1] <= buckets[i] - median) ? i - 1 : i;
            break; 
        }
    }
    this.radius = pivot;
    // parition data into inside and outside
    var iCount = buckets[pivot] - buckets[0];
    var oCount = (n - buckets[pivot]) - buckets[0];
    var iData = Array(iCount);
    var oData = Array(oCount);
    iCount = oCount = 0;
    for(var i=0; i<n; i++) {
        var v = data[i];
        if(v === r) { continue; };
        if(dist(r, v) <= pivot) {
            iData[iCount++] = v;
        } else {
            oData[oCount++] = v;
        }
    }
    // recursively create the rest of the tree
    if(iCount > 0) {
        this.inside = new VPTree(iData);
    }
    if(oCount > 0) {
        this.outside = new VPTree(oData);
    }
    return this;
}
VPTree.prototype.query = function(value, distance, result) {
    if(result === undefined) {
        return this.query(value, distance, []);
    }
    // leaf node, test all values
    if(this.values !== null) {
        for(var i=0; i<this.values.length; i++) {
            var v = this.values[i];
            if(dist(value, v) <= distance) {
                result.push(v);
            }
        }
        return result;
    }
    // recursively test the rest of the tree
    var tmpDistance = dist(value, this.center);
    // inside
    if(tmpDistance <= distance + this.radius) {
        if(tmpDistance <= distance) {
            result.push(this.center);
        }
        if(this.inside !== null) {
            this.inside.query(value, distance, result);
        }
    }
    // outside
    if(tmpDistance + distance > this.radius && this.outside !== null) {
        this.outside.query(value, distance, result);
    }
    return result;
}

EDIT Here's the JSFiddle showing a 2d (x, y) (8bits, 8bits) http://jsfiddle.net/fgq1rfLk/1/

score 0 · Answer 2 · answered Oct 29 '15 at 17:26

If the points have explicit coordinates and if d is not too large, which seems to be the case here, I think (but I may be wrong, needs testing) that a Kd-tree will be more efficient than a VP-tree, since it can benefit from more structure from the data (coordinates), whereas VP-tree only "sees" point-to-point distances.

There is an efficient implementation of Kd-trees with all the needed range search functions (L2 and Manathan metric) in ANN [1] (however, it stores all coordinates explicitly and you probably want to benefit from your "compressed coordinates" representation.

An alternative is my own implementation of KdTree in Geogram [2], it is quite simple (though highly inspired by ANN) and can probably quite easily be adapted to use your compressed coordinates representation (but it only has k nearest neighbors search with L2 metric)

Referencecs:

[1] https://www.cs.umd.edu/~mount/ANN/

[2] http://alice.loria.fr/software/geogram/doc/html/classGEO_1_1KdTree.html

Range search in d-space with discrete coordinates

2 Answers2