11

I am looking to implement an O(1) lookup for binary quadkeys

I know that there is no true hashtable for javascript, instead using objects and o(1) lookup with their properties, but the issue there is that the keys are always converted to strings.

I suspect to have over a 10 million entries in memory, and if I have to rely on keys being strings and with the average quadkey string being 11.5 characters that would equate to (10 million entries * 11.5 length * 2 bytes) = 230,000,000 bytes or 230 MB.

Compared to storing as int64 (10 million entries * 8 bytes) = 80,000,000 bytes or 80 MB

I know that int64 isn't supported natively with javascript, but there are libraries that can assist with that in terms of doing the bitwise operations i'd want.

Now, even though there exists libraries that can deal with int64, they are ultimately objects that are not truly representing 8 bytes, so I believe I cannot use any int64 library in a hashtable, instead I was considering using 2-deep hash table with two int32. The first key would be the first 4 bytes, and the 2nd key being the last 4 bytes. It's not as ideal as 1 operation lookup to find a value, but 2 operations which is still good enough.

However, I feel this is not worth it if all keys are stored as strings, or the fact that all numbers are double-precision floating-point format numbers (8 bytes) so it each hash entry would then take up 16 bytes (two "int32" numbers).

My questions are:

1. If you store a number as the key to a property, will it take up 8 bytes of memory or will it convert to string and take up length*2bytes?

  1. Is there a way to encode binary quadkeys into the double-precision floating-point format number standard that javascript has adopted, even if the number makes no sense , the underlying bits serve a purpose (unique identification of a construct).

PS: I am marking this with nodejs as there may exist libraries that could assist in my end goal


Edit 1:

Seems 1 is possible with Map() and node 0.12.x+

As far as number 2 I was able to use a int64 lib (bytebuffer) and convert a 64int to a buffer.

I wanted to just use the buffer as the key to Map() but it would not let me as the buffer was internally an object, which each instance acts as a new key to Map().

So I considered turning the buffer back into native type, a 64bit double.

Using readDoubleBE I read the buffer as a double, which represents my 64int binary and successfully lets me use it in a map and allows for O(1) lookup.

var ByteBuffer = require("bytebuffer");

var lookup = new Map();


var startStr = "123123738232777";
for(var i = 1; i <= 100000; i++) {
    var longStr = startStr + i.toString();
    var longVal = new ByteBuffer.Long.fromValue(longStr);

    var buffer = new ByteBuffer().writeUint64(longVal).flip().toBuffer();
    var doubleRepresentation = buffer.readDoubleBE();
    lookup.set(doubleRepresentation, longStr);
}

console.log(exists("12312373823277744234")); // true
console.log(exists("123123738232777442341232332322")); // false


function exists(longStr) {
    var longVal = new ByteBuffer.Long.fromValue(longStr);
    var doubleRepresentation = new ByteBuffer().writeUint64(longVal).flip().toBuffer().readDoubleBE();
    return lookup.has(doubleRepresentation);
}

The code is sloppy and there are probably shortcuts I am missing, so any suggestions/hints are welcomed.

Ideally I wanted to take advantage of bytebuffer's varints so that I can have even more savings with memory but I wasn't sure if that is possible in a Map, because I wasn't able to use a buffer as a key.


Edit 2:

Using memwatch-next I was able to see that max heap size was 497962856 bytes with this method with Set() whereas using a string in Set() was 1111082864 bytes. That is about 500MB vs 1GB, which is a far cry from 80mb and 230mb, not sure where the extra memory use is coming from. It's worth nothing for these memory tests I used Set over Map that way it should only store unique keys in the data structure. (Using Set as just a existence checker, where Map would serve as a lookup)

ParoX
  • 5,685
  • 23
  • 81
  • 152
  • 5
    If you use nodejs 4+ then you might want to look at _Map_ [MDN: Map - Objects and maps compared](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map#Objects_and_maps_compared): `[...]The keys of an Object are Strings and Symbols, where they can be any value for a Map.[...]` or maybe also _WeakMap_ . – t.niese Nov 05 '15 at 15:48
  • I was hoping to make a node package that was 0.12.x compatible as my own application is 0.12 right now, but good to know something like this exists for 4+. Worse case ill just update my application to support 4 – ParoX Nov 05 '15 at 16:00
  • If it is an option you might use `0.12.x` with `--harmony` but to be honest I don't know if `Map` is stable in harmony (but i would guess so). – t.niese Nov 05 '15 at 16:33
  • Looks like 0.12 can use map without `--harmony` http://stackoverflow.com/questions/28388885/ecmascript-6-features-available-in-node-js-0-12 – ParoX Nov 05 '15 at 18:29
  • possibly related: http://stackoverflow.com/q/24813487/1048572 – Bergi Nov 05 '15 at 21:50

2 Answers2

1

Because your keys are integers (and are unique) you could just use them as array indices. However, JS arrays are limited to contain max entries bounded by 32 or 64 bit integer depending on your platform.

To overcome this, you could use your two-step approach, but without using objects and their string hashes. You can store it something like this store[0010][1000] = 'VALUE' - the item with binary key 00101000 is stored under index 0010 in first array and index 1000 in child array

In decimal, you're dealing with store[2][8] = 'VALUE', which is equivalent to doing store[40] = 'VALUE' in 64bit space

You get yourself a tree with all the properties you want:

  • You can easily look up by key (in two steps)
  • Your keys are integers
  • You're dealing with 32bit integers (or less, depending on how you chunk it)
Andrei R
  • 4,904
  • 5
  • 25
  • 26
  • That should work if Array implementation in that particular VM is based on sparse arrays or hash tables. – c-smile Nov 19 '15 at 01:20
0

The doubling in memory from Map to Set in your version of node comes from a bad implementation. Well, not "bad" per se just not fit for millions of entries. The simpler handling of Set gets bought with memory. There's no free lunch, as always, sorry.

Why do they use so much in general? They are supposed to handle any object and the method to be able to handle all of the possible varieties is really expensive. It is possible to optimize if all you have is of one kind but you have to check for it and in 99,99% of all cases it's not worth the hassle because the maps and sets and arrays are short, a couple of thousand entries at most. To be bland: developer time is expensive and better spend elsewhere. I could add: it's open source, do it yourself! but I know it's way easier said than done ;-)

You need to roll it yourself. You can use a Uint32Array for that and build a hash-table around it.

The Bing-Maps are encoded with strings of base-4 digits (at most 23) according to MS and the Quad-key description. Using the encoding of the latter (haven't read the former, so it might be wrong in the details) we can put it into two 32-bit integers:

function quadToInts(quad, zoom){
  var high,low, qlen, i, c;
  high = 0>>>0;
  low = 0>>>0
  zoom = zoom>>>0;

  // checks & balances omitted!

  qlen = quad.length;

  for(i = 0; i < 16 && i < qlen ;i++){
    c = parseInt(quad.charAt(i));
    high |= c << (32-(i*2 + 2));
  }
  // max = 23 characters (says MS)
  for(i = 0; i < 8 && i < qlen - 16 ;i++){
    c = parseInt(quad.charAt(16 + i));
    low |= c << (32-(i*2 + 2));
  }

  low |= zoom;

  return [high,low];
}

And back

// obligatory https://graphics.stanford.edu/~seander/bithacks.html
function rev(v){
  var s = 32;
  var mask = (~0)>>>0;         
  while ((s >>>= 1) > 0) {
    mask ^= (mask << s)>>>0;
    v = ((v >>> s) & mask) | ((v << s) & (~mask)>>>0);
  }
  return v;
}

function intsToQuad(k1,k2){
  var r, high, low, zoom, c, mask;

  r = "";
  mask = 0x3; // 0b11

  high = k1>>>0;
  low = k2>>>0;
  zoom = low & (0x20 - 1);
  low ^= zoom;

  high = rev(high);
  for(var i = 0;i<16;i++){
    c =  high & mask;
    c = (c<<1 | c>>>1) & mask;
    r += c.toString(10);
    high >>>= 2;
  }
  low = rev(low);
  for(var i = 0;i<16;i++){
    c =  low & mask;
    c = (c<<1 | c>>>1) & mask;
    r += c.toString(10);
    low >>>= 2;
  }
  return [r,zoom];
}

(All quick hacks, please check before use! And the C&P devil might have had its hand in here, too)

A rough sketch for a hash-table base on the following hash-function

// shamelessly stolen from http://www.burtleburtle.net/bob/c/lookup3.c
function hashword(k1, // high word of 64 bit value
                  k2, // low word of 64 bit value
                  seed // the seed
                  ){

  var a,b,c;
  var rot = function(x,k){
     return (((x)<<(k)) | ((x)>>>(32-(k))));
  };

  /* Set up the internal state */
  a = b = c = 0xdeadbeef + ((2<<2)>>>0) + seed>>>0;

  if(arguments.length === 2){
    seed = k1^k2;
  }

  b+=k2;
  a+=k1;

  c ^= b; c -= rot(b,14)>>>0;
  a ^= c; a -= rot(c,11)>>>0;
  b ^= a; b -= rot(a,25)>>>0;
  c ^= b; c -= rot(b,16)>>>0;
  a ^= c; a -= rot(c,4)>>>0; 
  b ^= a; b -= rot(a,14)>>>0;
  c ^= b; c -= rot(b,24)>>>0;

  return c;
}
function hashsize(N){
    var highbit = function(n) {
        var r = 0 >>> 0;
        var m = n >>> 0;
        while (m >>>= 1) {
            r++;
        }
        return r;
    };

  return (1<<(highbit(N)+1))>>>0;
}
function hashmask(N){
  return (hashsize(N)-1)>>>0;
}

And the (rather incomplete) code for table handling

/*
 Room for 8-byte (64-bit) entries
  Table pos.  Array pos.
   0             0         high, low
   1             2         high, low
   2             4         high, lowl
           ...
   n             n*2       high, low

*/
function HashTable(N){
  var buf;
  if(!N)
    return null;

  N = (N+1) * 2;

  buf = new ArrayBuffer(hashsize(N) * 8);
  this.table = new Uint32Array(buf);
  this.mask = hashmask(N);
  this.length = this.table.length;
}

HashTable.prototype.set = function(s,z){
  var hash, quad, entry, check, i;

  entry = null;
  quad = quadToInts(s,z);

  hash = hashword(quad[0],quad[1]);

  entry = hash & this.mask;

  check = entry * 2;
  if(this.table[check] != 0 || this.table[check + 1] != 0){
    //handle collisions here
    console.log("collision in SET found")
    return null;
  } else {
    this.table[check] = quad[0];
    this.table[check + 1] = quad[1];
  }
  return entry;
};

HashTable.prototype.exists = function(s,z){
  var hash, quad, entry, check, i;

  entry = null;
  quad = quadToInts(s,z);

  hash = hashword(quad[0],quad[1]);
  entry = hash & this.mask;

  check = entry * 2;
  if(this.table[check] != 0 || this.table[check + 1] != 0){

    return entry;
  }
  return -1;
};

HashTable.prototype.get = function(index){
  var entry = [0,0];

  if(index > this.length)
    return null;

  entry[0] = this.table[index * 2];
  entry[1] = this.table[index * 2 + 1];

  return entry;
};

// short test
var ht = new HashTable(100);
ht.set("01231031020112310",17);
ht.set("11231031020112311",12);
ht.set("21231033020112310",1);
ht.set("31231031220112311321312",23);

var s = "";
for(var i=0;i<ht.table.length;i+=2){
  if(ht.table[i] !== 0){
     var e = intsToQuad(ht.table[i],ht.table[i+1]);
     s += e[0] +", " +e[1] + "\n";
  }
}
console.log(s)

Collisions should be rare, so a couple of short standard arrays would do to catch them. To handle it you need to add another byte to the 8 bytes for the two integers representing the Quad or, better, set the second integer to all ones (won't happen with a Quad) and the first to the position(s) in the collision array(s).

To add payload is a bit more complicated because you have only a fixed length to do so.

I have set the size of the table to the next higher power of two. That can be too much or even waaay too much and you might be tempted to adapt it, so be aware that the masking doesn't work as expected anymore you need to do a modulo instead.

deamentiaemundi
  • 5,502
  • 2
  • 12
  • 20