8

I'm looking for an efficient hash function for Rabin-Karp algorithm. Here is my actual code (C programming language).

static bool f2(char const *const s1, size_t const n1, 
               char const *const s2, size_t const n2)
{
    uintmax_t hsub = hash(s2, n2);
    uintmax_t hs   = hash(s1, n1);
    size_t   nmax = n2 - n1;

    for (size_t i = 0; i < nmax; ++i) {
        if (hs == hsub) {
            if (strncmp(&s1[i], s2, i + n2 - 1) == 0)
                return true;
        }
        hs = hash(&s1[i + 1], i + n2);
    }
    return false;
}

I considered some Rabin-Karp C implementations, but there are differences between all the codes. So my question is: what are the characteristics that a Rabin-Karp hash function should have?

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
md5
  • 23,373
  • 3
  • 44
  • 93
  • 3
    Have you already seen [this](http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm#Hash_function_used)? – Gigi Jul 18 '12 at 17:18
  • 1
    Rabin-Karp cannot use just any hash function, it requires a specialised hash function that can be quickly calculated for a position i from the already known value for position (i-1). – rossum Jul 18 '12 at 20:59
  • Yes, @Gigi, I have. But if there were a bit better hash function, it would be perfect (because I will run this function many times). @rossum : According to the Wikipedia article, I did a `rehash' function. – md5 Jul 19 '12 at 08:25
  • 1
    Have you read this [answer](http://stackoverflow.com/a/10339017/986743)? – zenpoy Jul 19 '12 at 14:36
  • @md5, maybe md5, anyway just joke :) – nurgasemetey Aug 07 '14 at 14:11

4 Answers4

10

A extremly good performing hash is the bernstein hash. It even outruns many popular hashing algorithms.

unsigned bernstein_hash ( void *key, int len )
{
    unsigned char *p = key;
    unsigned h = 0;
    int i;

    for ( i = 0; i < len; i++ )
        h = 33 * h + p[i];

    return h;
}

Of course, you can try out other hashing algorithms, as described here: Hash function on NIST

Note: It has never been explained why the 33 is performing so much better than any other "more logic" constant.

For your interest: Here is a good comparison of different hash algorithms: strchr comparison of hash algorithms

Mare Infinitus
  • 8,024
  • 8
  • 64
  • 113
  • 1
    Is it OK that "unsigned" is overflowed in the arithmetic operation? – Pupsik May 09 '17 at 07:48
  • @Pupsik Overflows are in general not an issue, so I'd say yes. – devoured elysium Mar 25 '18 at 23:33
  • 6
    How come this is accepted and a highly voted answer? It's time complexity is O(p). If this gets called for every window of the main text, the pattern search function will have time complexity O(p*t) which is same as the brute force approach. The question is about a hash function that works well with Rabin Karp algorithm. This function doesn't. – Saptarshi Basu Nov 08 '18 at 06:14
  • 7
    But Rabin-Carp's algorithm implies using of rolling hash function. Rolling hash function, is the function which have special properties: if we already know value of H(c[0..n]), for example, we can compute H(c[1..n+1]) *quickly*. This is property of rolling hash function, which bernstein hash doesn't have! I think, we should downvote this answer! – Kirill Frolov Mar 13 '19 at 11:39
  • I'm not sure why comments claim this hash function can't be efficiently rolled. It's equivalent (modulo `sizeof(unsigned)`) to the pseudo-representation of a string in base 33. If your window has length `k`, you can "remove" `str[i - k]` from the hash by subtracting the hash by `pow(33, k - 1) * str[i-k]`. i.e. to roll the hash by one, you do `hash = 33 * (hash - pow(33, k -1) * str[i-k]) + str[i]`. – Navneeth Jun 01 '20 at 23:08
2

what are the characteristics that a Rabin-Karp hash function should have?

Rabin-Karp needs a rolling hash. The easiest rolling hash is a moving sum. Adler-32 and Buzhash are pretty simple too and perform better than a moving sum.

Any of these rolling hash techniques should work for Rabin-Karp:

  • Moving sum
    • remove the oldest byte with subtraction
    • add a new byte with addition
  • Polynomial rolling hash
    • remove the oldest byte with subtraction
    • add a new byte with multiplication and addition
  • Rabin fingerprint
    • a polynomial rolling hash whose polynomial is irreducible over GF(2)
  • Tabulation hashing
    • remove the oldest byte with a table lookup and an xor
    • add a new byte with a table lookup and an xor
  • Cyclic polynomial, aka Buzhash
    • tabulation hashing based on circular shifts
  • Adler-32 checksum
    • not a rolling checksum by default but easily adjusted to "roll"
    • remove the oldest byte with two subtractions
    • add a new byte with two additions
mndrix
  • 3,131
  • 1
  • 30
  • 23
0

For the problem with the small alphabets, such as nucleic acid sequence searching (e.g. alphabet = {A, T, C, G, U}), nt-Hash may be a good hash function. It uses the binary operation, which is faster, and rolling hash update, and it also gives uniform distributed hash values.

Chris Tang
  • 567
  • 7
  • 18
0

Considering that the implementors of Java's JDK would have given some thought, I looked up what function is used there.

As of ~ Java 19, https://github.com/openjdk/jdk/blob/jdk-19+23/src/java.base/share/classes/java/lang/String.java#L2326

The update function is:

h' = 31 * h + c

initial value is 0.