Hashing array of strings in javascript

Question

Just wondering if there is some other way than this.

var hashStringArray = function(array) {
    array.sort();
    return array.join('|');
};

I don't like sorting much and using that delimiter is not safe either if it's contained in one of the strings. In overall I need to produce same hash no matter the order of strings. It will be rather short arrays (up to 10 items), but it will be required very often so it shouldn't be too slow.

I intend to use it with ES6 Map object and I need to easily find same array collection.

Updated example of use

var theMap = new Map();
var lookup = function(arr) {
    var item = null;
    var hashed = hashStringArray(arr);
    if (item = theMap.get( hashed )) {
        return item;
    }
    theMap.set( hashed, itemBasedOnInput );
    return itemBasedOnInput;
}

var arr1 = ['alpha','beta','gama'];
var arr2 = ['beta','alpha','gama'];

lookup(arr1) === lookup(arr2)

Performance tests

http://jsperf.com/hashing-array-of-strings/5

I could be wrong but doesn't ES6 Map support non-string keys? i.e. `theMap.set(arr1, "test")`. Still, you could use JSON.stringify(arr1) or create a "real" hash function that converts a string into a number... — Bart, Aug 03 '14 at 12:11
Of course it does, but it compares it by references, not the deep equality. You can try on your own, array with exactly same elements (and order) will be seen as different. — FredyC, Aug 03 '14 at 12:12
`JSON.stringify` could be an option, probably better than `join`, but I still need to make that sort. I doubt that can be avoided anyway. — FredyC, Aug 03 '14 at 12:13
@FredyC I believe `join` would be more efficient than `JSON.stringify` — thefourtheye, Aug 03 '14 at 12:14
@thefourtheye That's probably true, but as I said, there is danger in having chosen delimiter in one of the string. JSON would take care of that. — FredyC, Aug 03 '14 at 12:16
@FredyC Can you please give an example where the delimiter would be a problem? — thefourtheye, Aug 03 '14 at 12:17
Would you mind converting your arrays in an MD5 hash string? It will result in a string that has always the same size. — Marco Bonelli, Aug 03 '14 at 12:18
On the other hand, you could just use `arr.join("\0")`, it seems unlikely anyone will put a null character in a string... — Bart, Aug 03 '14 at 12:19
@thefourtheye Hm, you are probably right, I were kinda overthinking this :) — FredyC, Aug 03 '14 at 12:20
@thefourtheye If you use `|` as the delimiter, these two arrays would produce the same hash: `["abc", "def|ghi"]` and `["abc","def","ghi"]`. — Barmar, Aug 03 '14 at 12:22
@MarcoBonelli I don't really care about size of string, but I don't know how MD5 is going to improve anything. It's most probably slower than `join` Not even speaking about following lookup in the Map that could take longer. — FredyC, Aug 03 '14 at 12:22
Also, there's no guarantee that MD5 will be unique, although collisions are unlikely. — Barmar, Aug 03 '14 at 12:23
Ok, so based on comments it seems there is not really better way than this. Probably using the "\0" seems like most sensible option. If nobody has any other idea, I would probably give credits to @Bart if he makes more consise answer below. — FredyC, Aug 03 '14 at 12:27
what is the size of the set of possible strings ? or are they arbitrary strings ? — GameAlchemist, Aug 03 '14 at 12:48
ES6 maps do not accept hash codes, they will hash an array based on reference. I recommend against it. — Benjamin Gruenbaum, Aug 03 '14 at 13:08
@BenjaminGruenbaum I am using Map just for convenience because I like it more than plain objects. — FredyC, Aug 03 '14 at 13:49
@GameAlchemist As I said, it's up to 10 in size and strings and their combination will different most of the time. — FredyC, Aug 03 '14 at 13:50

sifriday · Accepted Answer · 2014-08-03T18:55:17.520

Two things occurred to me as the basis of a solution:

summing doesn't depend on order, which is actually a flaw in simple checksums (they don't catch changes in block order within a word), and
we can convert strings to summable numbers using their charcodes

Here's a function to do (2) :

charsum = function(s) {
  var i, sum = 0;
  for (i = 0; i < s.length; i++) {
    sum += (s.charCodeAt(i) * (i+1));
  }
  return sum
}

Here's a version of (1) that computes an array hash by summing the charsum values:

array_hash = function(a) {
  var i, sum = 0
  for (i = 0; i < a.length; i++) {
    var cs = charsum(a[i])
    sum = sum + (65027 / cs)
  }
  return ("" + sum).slice(0,16)
}

Fiddle here: http://jsfiddle.net/WS9dC/11/

If we did a straight sum of the charsum values, then the array ["a", "d"] would have the same hash as the array ["b", "c"] - leading to undesired collisions. So based on using non-UTF strings, where charcodes go up to 255, and allowing for 255 characters in each string, then the max return value of charsum is 255 * 255 = 65025. So I picked the next prime number up, 65027, and used (65027 / cs) to compute the hash. I am not 100% convinced this removes collisions... perhaps more thought needed... but it certainly fixes the [a, d] versus [b, c] case. Testing:

var arr1 = ['alpha','beta','gama'];
var arr2 = ['beta','alpha','gama'];

console.log(array_hash(arr1))
console.log(array_hash(arr2))
console.log(array_hash(arr1) == array_hash(arr2))

Outputs:

443.5322979371356 
443.5322979371356
true

And testing a case that shows different hashes:

var arr3 = ['a', 'd'];
var arr4 = ['b', 'c'];

console.log(array_hash(arr3))
console.log(array_hash(arr4))
console.log(array_hash(arr3) == array_hash(arr4))

outputs:

1320.651443298969
1320.3792001649144
false

Edit:

Here's a revised version, which ignore duplicates from the arrays as it goes, and return the hash based on unique items only:

http://jsfiddle.net/WS9dC/7/

array_hash = function(a) {
  var i, sum = 0, product = 1
  for (i = 0; i < a.length; i++) {
    var cs = charsum(a[i])
    if (product % cs > 0) {
      product = product * cs
      sum = sum + (65027 / cs)  
    }
  }
  return ("" + sum).slice(0, 16)
}

testing:

var arr1 = ['alpha', 'beta', 'gama', 'delta', 'theta', 'alpha', 'gama'];
var arr2 = ["beta", "gama", "alpha", "theta", "delta", "beta"];

console.log(array_hash(arr1))
console.log(array_hash(arr2))
console.log(array_hash(arr1) === array_hash(arr2))

returns:

689.878503111701
689.878503111701
true

Edit

I've revised the answer above to account for arrays of words that have the same letters. We need these to return different hashes, which they now do:

var arr1 = ['alpha', 'beta']
var arr2 = ['alhpa', 'ateb']

The fix was to add a multiplier to the charsum func based on the char index:

sum += (s.charCodeAt(i) * (i+1));

Could you please add this to that benchmark suite ? http://jsperf.com/hashing-array-of-strings — FredyC, Aug 03 '14 at 14:33
yep, will do - it will be interesting to work out how it compares. Also, I'd like to think of a good testwrap that looks for collisions, but this might be beyond me! It might actually be easier to think through the maths and look for something that might break it. — sifriday, Aug 03 '14 at 14:34
Not that big difference, but definitely faster. However it doesn't really solves collisions... http://jsfiddle.net/WS9dC/4/ — FredyC, Aug 03 '14 at 14:48
sorry if I am being slow, but the example in your fiddle looks OK - the arrays have different hashes and the comparison returns false, as desired? for me it returns: 975.5782767769929, 847.7110273835465, false — sifriday, Aug 03 '14 at 14:50
Well, I had intentionally added duplicates there hoping it will filter them out as you had said (or I thought you did meant that). Both arrays contains group `'alpha', 'beta', 'gama', 'delta', 'theta'` when filtered and sorted. — FredyC, Aug 03 '14 at 14:52
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/58572/discussion-between-fredyc-and-sifriday). — FredyC, Aug 03 '14 at 14:53
I had found a weak spot of this. Since it's based on sum of characters, any word that has the same characters would be considered identical. That's quite bad :( ... http://jsfiddle.net/WS9dC/9/ — FredyC, Aug 03 '14 at 18:10
better fix and test cases: http://jsfiddle.net/WS9dC/11/ ... the change was the addition line in charsum, which is now: sum += (s.charCodeAt(i) * (i+1)); — sifriday, Aug 03 '14 at 18:56

score 1 · Answer 2 · answered Aug 03 '14 at 12:51

1

If you calculate a numeric hash code for each string, then you can combine them with an operator where the order doesn't matter, like the ^ XOR operator, then you don't need to sort the array:

function hashStringArray(array) {
  var code = 0;
  for (var i = 0; i < array.length; i++) {
    var n = 0;
    for (var j = 0; j < array[i].length; j++) {
      n = n * 251 ^ array[i].charCodeAt(j);
    }
    code ^= n;
  }
  return code
};

answered Aug 03 '14 at 12:51

Guffa

687,336
108
737
1,005

Well isn't that too slow ? As I said, I expect this function to be called pretty often. I can hardly imagine it would be performance wise to loop over these strings every time. – FredyC Aug 03 '14 at 13:51
@FredyC: Well, you would have to profile that to find out. Copying strings is also a way to loop through all data, and that also allocates memory, so it's not obvious which would be faster. – Guffa Aug 03 '14 at 13:54
Thanks for the effort @Guffa, I will give it a try when hunting for the performance. It's definitely interesting approach too. – FredyC Aug 03 '14 at 14:11
+1 for using charCodeAt to generate a numeric representation of the strings, then using an operator that doesn't care about ordering. However, Guffa - doesn't this lead to easy collisions? eg try it with var c = ['c', 'd', 'f']; var d = ['a', 'b', 'b']; – sifriday Aug 03 '14 at 14:19
@sifriday I am expecting to filter out duplicates before running this. I don't expect there is a way how to avoid that. – FredyC Aug 03 '14 at 14:24
Anyway, I gave it a try and made benchmark. Frankly I am surprised, because @Guffa's solution is really much more faster... http://jsperf.com/hashing-array-of-strings – FredyC Aug 03 '14 at 14:25
I've just posted a suggestion below, using straightforward sum instead of XOR, and a prime number to try to reduce collisions. This is OK with duplicate data - I think - allowing for array entries of 255 chars each... – sifriday Aug 03 '14 at 14:27
In `n = n * 251 ^ array[i].charCodeAt(j); -> n = (n * 251) ^ array[i].charCodeAt(j);`, what is the purpose of `n * 251`? – tonix Oct 22 '19 at 19:12
Also, why using XOR instead of other commutative bitwise operators like OR? – tonix Oct 23 '19 at 06:40

score 0 · Answer 3 · answered Aug 03 '14 at 12:33

You can do this:

var hashStringArray = function(array) {
    return array.sort().join('\u200b');
};

The \u200b character is an unicode character that also means null, but is not the same as the \0 character, which is most widely used.

'\u200b' == '\0'

> false

GameAlchemist · Answer 4 · 2014-08-03T19:04:44.597

An idea to have very fast hash if your set of possible string is less than 32 items long : hash the string with a built-in hash function that will return power-of two as hash :

function getStringHash(aString) {
   var currentPO2 = 0;
   var hashSet = [];
   getStringHash = function ( aString) {
       var aHash = hashSet[aString];
       if (aHash) return aHash;
       aHash = 1 << currentPO2++;
       hashSet[aString] = aHash; 
       return aHash;
   }
   return getStringHash(aString);
}

Then use this hash on your string array, ORing the hashes ( | ) :

function getStringArrayHash( aStringArray) {
    var aHash = 0;
    for (var i=0; i<aStringArray.length; i++) {
        aHash |= getStringHash(aStringArray[i]);
    }
    return aHash;
}

So to test a bit :

console.log(getStringHash('alpha'));  // 1
console.log(getStringHash('beta'));   // 2
console.log(getStringHash('gamma'));  // 4
console.log(getStringHash('alpha'));  // 1 again

var arr1 = ['alpha','beta','gama'];
var arr2 = ['beta','alpha','gama'];
var arr3 = ['alpha', 'teta'];

console.log(getStringArrayHash(arr1)); // 11
console.log(getStringArrayHash(arr2)); // 11 also, like for arr1

var arr3 = ['alpha', 'teta'];
console.log(getStringArrayHash(arr3)); // 17 : a different array has != hashset

jsbin is here : http://jsbin.com/rozanufa/1/edit?js,console

RQ !!! with this method, arrays are considered as set, meaning that a repeated item won't change the hash of an array !!!

This HAS to be faster since it uses only 1) function call 2) lookup 3) integer arithmetic. So no sort, no (long) string, no concat.

jsperf confirms that : http://jsperf.com/hashing-array-of-strings/4

enter image description here

EDIT :

version with prime numbers, here : http://jsbin.com/rozanufa/3/edit?js,console

        // return the unique prime associated with the string.
    function getPrimeStringHash(aString) {
       var hashSet = [];
       var currentPrimeIndex = 0;
       var primes = [ 2, 3, 5, 7, 11, 13, 17 ];
       getPrimeStringHash = function ( aString) {
           var aPrime = hashSet[aString];
           if (aPrime) return aPrime;
           if (currentPrimeIndex == primes.length) aPrime = getNextPrime();
           else aPrime = primes[currentPrimeIndex]; 
           currentPrimeIndex++
           hashSet[aString] = aPrime; 
           return aPrime;
       };
       return getPrimeStringHash(aString);
       // compute next prime number, store it and returns it.
       function getNextPrime() {
         var pr = primes[primes.length-1];
         do {
             pr+=2;
             var divides = false;
             // discard the number if it divides by one earlier prime.
             for (var i=0; i<primes.length; i++) {
                 if ( ( pr % primes[i] ) == 0 ) {
                     divides = true;
                     break;
                 }
             }
          } while (divides == true)
          primes.push(pr);
         return pr;
        }
    }

    function getStringPrimeArrayHash( aStringArray) {
        var primeMul = 1;
        for (var i=0; i<aStringArray.length; i++) {
            primeMul *= getPrimeStringHash(aStringArray[i]);
        }
        return primeMul;
    }

    function compareByPrimeHash( aStringArray, anotherStringArray)  {
        var mul1 = getStringPrimeArrayHash ( aStringArray ) ;
        var mul2 = getStringPrimeArrayHash ( anotherStringArray ) ;
        return  ( mul1 > mul2 ) ? 
                                   ! ( mul1 % mul2 ) 
                                 : ! ( mul2 % mul1 );
      // Rq : just test for mul1 == mul2 if you are sure there's no duplicates
    }

Tests :

console.log(getPrimeStringHash('alpha'));  // 2
console.log(getPrimeStringHash('beta'));   // 3
console.log(getPrimeStringHash('gamma'));  // 5
console.log(getPrimeStringHash('alpha'));  // 2 again
console.log(getPrimeStringHash('a1'));  // 7 
console.log(getPrimeStringHash('a2'));  // 11


var arr1 = ['alpha','beta','gamma'];
var arr2 = ['beta','alpha','gamma'];
var arr3 = ['alpha', 'teta'];
var arr4 = ['alpha','beta','gamma', 'alpha']; // == arr1 + duplicate 'alpha'

console.log(getStringPrimeArrayHash(arr1)); // 30
console.log(getStringPrimeArrayHash(arr2)); // 30 also, like for arr1

var arr3 = ['alpha', 'teta'];
console.log(getStringPrimeArrayHash(arr3)); // 26 : a different array has != hashset

console.log(compareByPrimeHash(arr1, arr2) ); // true
console.log(compareByPrimeHash(arr1, arr3) ); // false
console.log(compareByPrimeHash(arr1, arr4) ); // true despite duplicate

I haven't inspected it much, but would you please add this to the benchmark test so we have comparison with @sifriday solution ? — FredyC, Aug 03 '14 at 15:37
Well that definitely looks even faster and solves duplicates, however you misunderstood the intentional slow down of first two tests. This way the filtering is not accounted for benchmark. Also your solution is not using the lookup from the test suite. I have updated it now ... http://jsperf.com/hashing-array-of-strings/5 ... It still beats everything else, awesome ! — FredyC, Aug 03 '14 at 16:35
Hmm, I just realized I cannot use this solution. There will be definitely more than 32 strings in total :( — FredyC, Aug 03 '14 at 17:55
no issue with that : use prime numbers instead of power of two and multiply instead of ORing. Two hashes are equal if their modulo is == 0. — GameAlchemist, Aug 03 '14 at 18:07
I could use some help with that. I don't even know how to find prime numbers programatically :( — FredyC, Aug 03 '14 at 18:12
Ok so I have figured out prime numbers and I understand multiplying. I am not sure about last part. For what I need to run modulo ? Isn't it enough to just compare those two numbers with equality ? — FredyC, Aug 03 '14 at 18:42
see my edit. If you're sure there's no duplicate, just use ==. Now if there may be duplicate, you must test for ( max(a,b) % min (a,b) ) == 0 to be sure. — GameAlchemist, Aug 03 '14 at 19:07
Ok thank you, I was on the similar route with those prime numbers. I will try to implement this into my app. — FredyC, Aug 03 '14 at 19:44
I have made small test to find out how many prime numbers I can use in case all of them would be multiplied. I must say it's not much, just 132. http://jsfiddle.net/9CV8Z/, but that's really edge case. — FredyC, Aug 04 '14 at 07:10
After taking only last 10 digits from primes list (which is more real for my case) it's much more satisfactory. In fact I had no patience to wait for the end. It will be a lot of numbers for sure. http://jsfiddle.net/9CV8Z/1/ — FredyC, Aug 04 '14 at 07:14

Hashing array of strings in javascript

Updated example of use

Performance tests

4 Answers4