5

I am in the process of learning about bloom filters and I am looking through various hash functions in JavaScript.

For example I found this one in another Stack Overflow answer:

Found here https://stackoverflow.com/a/7616484/5217568)

String.prototype.hashCode = function() {
  var hash = 0, i, chr, len;
  if (this.length == 0) return hash;
  for (i = 0, len = this.length; i < len; i++) {
    chr   = this.charCodeAt(i);
    hash  = ((hash << 5) - hash) + chr;
    hash |= 0; // Convert to 32bit integer
  }
  return hash;
};

If I run:

String.prototype.call(null, "hello") 

I get the numerical value of : 99162322 (two other hash functions got me: 1335831723, and 120092131).

Now If I create a hypothetical bloom filter with 3 hash functions, and 18 indices (k=3, m=18), How are these large values indexed in an array with indices from 0-17?

Community
  • 1
  • 1
jmancherje
  • 6,439
  • 8
  • 36
  • 56

1 Answers1

2

Use the remainder/modulo operator % to wrap a randomly-generated value within a certain bound.

If you have 18 elements (indices 0 to 17), you could get an index with 99162322 % 18 (16).

If the number of hash value is not a multiple of the number of indices, the result will be biased. For example, if your hash value is one of the five values from 0 to 4, but you were mapping it onto the three indices from 0 to 2, it would be biased towards 0 (0 % 3, 3 % 3) and 1 (1 % 3 or 4 % 3) over 2 (only 2 % 3). Depending on your needs, the bias may be acceptable if the number of hash values is sufficiently larger than the number of indices. If you want to to avoid it you'll need a scheme to generate a new hash input if the hash result is from the bias-inducing range. Something like this:

function hashIndex(string, length, hashValueCount) {
  var minBiasedIndex = hashValueCount - (hashValueCount % length);
  for (var i = 0; ; i++) {
    var hashInput = string + ":" + String(i);
    var hashResult = hash(hashInput);
    if (hashResult < minBiasedIndex) {
      return hashResult % length;
    }
  }
}
Jeremy
  • 1
  • 85
  • 340
  • 366
  • This is very interesting, thank you. Can you help clarify the parameters in your function. For my example in the question, would it be hashIndex("hello", 18, 17)? Where 18 is the length of the array, and 17 is the highest index in the array? – jmancherje Nov 28 '15 at 05:52
  • 1
    I think the number of possible hash values (hashValueCount) would be something like 2 to the power of 31 for you. I think it your hash function can return any unsigned 31-bit integer, though I may misunderstand. I think that might be too large for my minBiasedIndex calculation to work because of the limits of % -- oops. (Also, I initially had some off-by-one errors in my function. I think it's fixed now.) – Jeremy Nov 28 '15 at 05:57
  • (and yeah, length would be 18 and string would be "hello".) Given that your number of possible hash values is so much larger than your number of indices, I *think* the bias is very small -- less than 0.000001%. I might consider just use using modulo operator directly and accepting this bias, depending on the application. – Jeremy Nov 28 '15 at 06:07