2

I wrote a function that is given 2 strings, s and a and it is supposed to return position of the first occurrence of a in s. It works fine with a as one character, but otherwise it stops working if the first occurrence is after 3rd character in s.

I've already checked if mul and add work, if hash of a is correct and I've reduced bases to 10 and 100 (which are not very good for hashing cause they're not prime) and it worked (on strings of length of 20). This might mean that modulo doesn't work as expected.

function getIndex(s, a) {

    // 2 bases and 2 mods to reduce the number of collisions

    var base1 = 31337;
    var base2 = 31357;

    var mod1 = 1e9 + 7;
    var mod2 = 1e9 + 9;

    //suffix arrays

    var hs1 = new Uint32Array(s.length);
    var hs2 = new Uint32Array(s.length);

    // bases to the power of a.length

    var ba1 = 1;
    var ba2 = 1;

    // hashes for a

    var ha1 = 0;
    var ha2 = 0;

    //operators

    var mul = (x, y, mod) => (x * y) % mod;

    var add = (x, y, mod) => {
        x += y;
        if(x >= mod) x -= mod;
        if(x < 0) x += mod;
        return x;
    }

    //get hash of a and find value of ba1 and ba2

    for(var i = 0; i < a.length; i++) {
        ha1 = add(mul(ha1, base1, mod1), a.charCodeAt(i), mod1);
        ha2 = add(mul(ha2, base2, mod2), a.charCodeAt(i), mod2);

        ba1 = mul(ba1, base1, mod1);
        ba2 = mul(ba2, base2, mod2);
    }

    //make suffix array

    var h1 = 0;
    var h2 = 0;

    for(var i = 0; i < s.length; i++) {
        h1 = add(mul(h1, base1, mod1), s.charCodeAt(i), mod1);
        h2 = add(mul(h2, base2, mod2), s.charCodeAt(i), mod2);

        hs1[i] = h1;
        hs2[i] = h2;
    }

    //Compare hashes of substrings of s (by removing prefix from the current element) with hash of a

    for(var i = a.length - 1; i < s.length; i++) {
        var h1 = hs1[i];
        var h2 = hs2[i];
        if(i >= a.length) {
            console.log(i, i - a.length, h1);
            h1 = add(h1, -mul(hs1[i - a.length], ba1, mod1), mod1);
            h2 = add(h2, -mul(hs2[i - a.length], ba2, mod2), mod2);
        }
        if(h1 == ha1 && h2 == ha2) return i - a.length + 1;
    }

    return -1;
}
getIndex("abcdefgh", "f") //returns 5
getIndex("abcdefgh", "fg")//returns -1
Jaydeep
  • 1,686
  • 1
  • 16
  • 29
Fran x1024
  • 41
  • 5
  • Can you please explain what is the purpose of your hash function? What does it have to do with the task to retrieve the first index where a substring occurs in another string? This could be simply done with `s.indexOf(a)` as you might know. – SparkFountain Sep 05 '19 at 14:18
  • Cause in this example, the complexity of indexOf would be (n - m) * m (or simplified O(n * m) ), but with hashing (in this code it's purpose is to turn strings to number) and suffix array I think the complexity will be O(n) (1 for loop from 0 to m and 2 for loops from 0 to n). – Fran x1024 Sep 05 '19 at 21:04
  • I haven't explained what was it's purpose, using hashing I make a suffix array of numbers which represent suffixes of the string. Due to the properties of this hash function (if I take `(x - len)`th elem from suffix array, multiply it by base to power of `len` and subtract from `x`th element, I'll get the hash of substring from `(x - len)`th element to `x`th element) I can get hash of every substring very fast and compare it to the hash of the string I am looking for (which is comparing numbers and should also be pretty fast). – Fran x1024 Sep 05 '19 at 21:14
  • Ok, so you want to compare *hashed* strings. I thought that your function should still compare *plain text* strings. – SparkFountain Sep 06 '19 at 05:37
  • Your "suffix array" seems to store the hashes of prefixes? – Bergi Sep 08 '19 at 01:54

1 Answers1

1

With a bit more logging, it becomes clear that you're victim to floating point inaccuracies. JS numbers can only represent 52 bit of integers precisely.

In getIndex("abcdefgh", "fg"), the searched hash ha1 is 3196477, your base multiplier ba1 is 982007569, and in the sixth iteration the prefix hashes are hs1[6] = 73644174 and hs1[4] = 800389532. If you put those in your arithmetic function, the result is 3196441, off by 6. This is because 800389532 * 982007569 is about two orders of magnitude larger than the maximum safe integer and in JS does evaluate to 785988578572367700, which is clearly wrong.

You would need to either choose your modulus and base smaller, or use bigints for all your calculations.

Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • I've changed line 28 to `var mul = (x, y, mod) => Number((BigInt(x) * BigInt(y)) % BigInt(mod));` and now it works, but I don't know how fast BigInt is. I found out that it's size is only limited by memory, so I am not sure about how fast it is (mathematical operations and conversion). – Fran x1024 Sep 08 '19 at 16:57
  • I also don't think it would be smart to reduce bases and mods cause it would make space for more collisions, matching hashes of strings that are not same. – Fran x1024 Sep 08 '19 at 16:58
  • @Franx1024 You have to deal with collisions anyway, they're always bound to happen. When finding a *possible* solution with the hashes, you always need to confirm it by actually comparing the strings. – Bergi Sep 08 '19 at 17:07
  • @Franx1024 Regarding performance, bigints a probably quite slow. They would however allow you to use only a single, big hash instead of two small hashes with different bases and modulos, so that you can get rid of all that code duplication. And yes, then you'll have to benchmark against `indexOf`. I'm pretty certain that JS engines did implement [a string search algorithm](https://en.wikipedia.org/wiki/String-searching_algorithm) that does not have the naive quadratic runtime. – Bergi Sep 08 '19 at 17:19
  • I found this O(nm) complexity of indexof here https://stackoverflow.com/questions/12752274/java-indexofstring-str-method-complexity , but when I took a look at source code ( https://chromium.googlesource.com/v8/v8/+/4.3.49/src/string.js?autodive=0%2F%2F ), it seemed like they're using some sort of suffix array too. – Fran x1024 Sep 09 '19 at 16:57
  • @Franx1024 That `O(nm)` is the worst-case complexity of the *Java* implementation – Bergi Sep 09 '19 at 18:58
  • I didn't notice it was a Java post. – Fran x1024 Sep 13 '19 at 14:57
  • Than this function is pretty much useless. Also, I know O(nm) is the worst complexity, this would happen only when the occurrence is at the end of the string and if the rest of the string is filled with substrings which do not match only at the last character. – Fran x1024 Sep 13 '19 at 14:58
  • (by this function I mean my replacement for indexOf) – Fran x1024 Sep 13 '19 at 14:59