13

I'm trying to hash a string into an integer for placing it in an array. However I do not know all too much about hashing functions, and that's why my current method is just adding all the ASCII numbers of the characters together and taking it mod the array size.

Are there any simple faster/better methods?

Little Helper
  • 2,419
  • 9
  • 37
  • 67
Hal
  • 181
  • 1
  • 2
  • 4

6 Answers6

13

The FNV-1a hash is quick and easy to implement.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • It's quick, easy and terrible. Its avalanche characteristics are unacceptable. If you must use an FNV variant, FNV-1a is much, much better in this regard. Or, better yet, don't use FNV at all. There are so many better alternatives that this one is a waste of time. – Steven Sudit Sep 11 '10 at 10:38
  • +1. FNV Wikipedia entry beats that one of the Jenkins. And language agnostic. – Dummy00001 Sep 11 '10 at 11:11
  • @Dummy: Only because they omitted the avalanche results. It turns out that FNV-1a doesn't suck, but FNV does. For that matter, Jenkins' lookup3 is ok, whereas his one-at-a-time sucks. – Steven Sudit Sep 11 '10 at 11:15
  • @Greg: I doubt you care about your rep score any more than I do, but if you edit your answer to specify FNV-1a, then I'l be glad to remove the downvote. I still don't like this FNV variant, but at least it's not terrible. – Steven Sudit Sep 11 '10 at 12:03
  • @Steven Sudit: That's fair, changed the link. – Greg Hewgill Sep 11 '10 at 12:08
  • Done. If I'm reading the chart at http://www.strchr.com/hash_functions right, it looks like FNV-1a is optimized for longer inputs, so it would not do well with short strings. – Steven Sudit Sep 11 '10 at 12:21
  • 3
    [Comparing FNV-1, FNV-1a, DJB2, DJB2a, SDBM, CRC32](http://programmers.stackexchange.com/questions/49550/what-hashing-algorithm-is-good-for-uniqueness-and-speed/145633#145633). Of those **FNV-1a** is the best. – Ian Boyd Apr 26 '12 at 17:17
  • @IanBoyd That was very helpful. I'm glad you added Murmur2 and I'm not surprised that it did so well, but I'm wondering if you've tried Murmur3. – Steven Sudit Apr 27 '12 at 23:05
  • @StevenSudit It's a coincidence that you mention it. i've been looking at it for the past hour. i've translated it into Delphi, but there's no test vectors out there, so i can't be sure i have it right (it's very endian sensitive; and who knows what order the original C++ is doing things in). – Ian Boyd Apr 27 '12 at 23:25
  • @IanBoyd I suspect you can find sample vectors, or just ask the author. – Steven Sudit May 08 '12 at 00:44
  • @StevenSudit There's been an open issue on the project homepage since 2009 asking for test vectors. – Ian Boyd May 08 '12 at 00:50
  • @StevenSudit No, but [you're free to harass him if you like](http://code.google.com/p/smhasher/issues/detail?id=6). – Ian Boyd May 08 '12 at 21:05
  • @IanBoyd Ok, I dropped him a note, using our employer emails. – Steven Sudit May 09 '12 at 22:30
  • @StevenSudit i would't worry about it too much. i'm confident my implementation is correct. But even if it's not, the MurmurHash3 performance is lower than that of MurmurHash2. i even went down to `asm`, calling `rol` instruction rather than the algorithm's own implementation of "rotate". Given that it's slower, and it's lack of test vectors, i've not pursued it further - deciding on `MurmurHash2` is *"the best for my hash table purposes"*. – Ian Boyd May 10 '12 at 16:43
  • @IanBoyd Austin responded on smhasher with: 'SMHasher uses the VerificationTest function to check if a hash function is correctly implemented - it doesn't rely on correct capitalization or punctuation of "The quick brown fox..." and catches more implementation errors than a single test vector can. I'll make a note on the front page to point it out.' – Steven Sudit May 11 '12 at 18:33
  • @IanBoyd On a side note, depending on the compiler, you can get C/C++ to emit a ROL instruction for you without messing with assembler directly. – Steven Sudit May 11 '12 at 18:34
  • I guess you tested a pascal slow version of crc32. crc32 could be fast, faster than most others, if used with 8KB of lookup tables and asm, and even much faster than alternative if your CPU has SSE4.2 hardware instructions of it. See http://blog.synopse.info/post/2014/05/25/New-crc32c%28%29-function-using-optimized-asm-and-SSE-4.2-instruction About hash distribution, crc32 is very good - see https://www.delphitools.info/2014/08/25/string-hashing-shootout/ You may correct your answer about crc32. – Arnaud Bouchez Jun 19 '15 at 18:00
4

See http://www.strchr.com/hash_functions for a very good panel of hashing functions.

In Delphi implementation, here are several versions:

The first coming to mind is the one used in TStringHash.HashOf method from official IniFiles.pas unit. Including a faster asm version:

function HashOf(P: PByteArray; Len: integer): cardinal;
// algorithm from IniFiles.TStringHash.HashOf
{$ifdef PUREPASCAL}
var I: Integer;
begin
  Result := 0;
  for I := 1 to Len do
    Result := ((Result shl 2) or (Result shr (SizeOf(Result)*8-2))) xor P[I];
end;
{$else}
asm // faster asm version by Synopse
    or edx,edx
    jz @z
    push ebx
    mov ebx,edx     // ebx = length(Key)
    mov edx,eax     // edx = Text
    xor eax,eax     // eax = Result
    xor ecx,ecx     // ecx = Result shl 2 = 0
@1: shr eax,$1e     // eax = Result shr (SizeOf(Result) * 8 - 2))
    or ecx,eax      // ecx = ((Result shl 2) or (Result shr (SizeOf(Result)*8-2)))
    movzx eax,byte ptr [edx] // eax = ord(Key[i])
    inc edx
    xor eax,ecx     // eax = () xor ord(Key[i])
    dec ebx
    lea ecx,[eax*4] // ecx = Result shl 2
    jnz @1
    pop ebx
@z:
end;
{$endif}

The classic Kernighan & Ritchie hash from "The C programming Language", 3rd edition - not the best, but simple and efficient code.

function kr32(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal;
var i: integer;
begin
  for i := 0 to len-1 do
    crc := ord(buf[i])+crc*31;
  result := crc;
end;

The fast "Adler" CRC as implemented in zlib - optimized asm version here:

function Adler32Pas(Adler: cardinal; p: pointer; Count: Integer): cardinal;
var s1, s2: cardinal;
    i, n: integer;
begin
  s1 := LongRec(Adler).Lo;
  s2 := LongRec(Adler).Hi;
  while Count>0 do begin
    if Count<5552 then
      n := Count else
      n := 5552;
    for i := 1 to n do begin
      inc(s1,pByte(p)^);
      inc(cardinal(p));
      inc(s2,s1);
    end;
    s1 := s1 mod 65521;
    s2 := s2 mod 65521;
    dec(Count,n);
  end;
  result := word(s1)+cardinal(word(s2)) shl 16;
end;

My own faster variant - not re-entrant, but faster since it will read by DWORDs - and an even faster asm version here:

function Hash32(Data: pointer; Len: integer): cardinal;
function SubHash(P: PCardinalArray; L: integer): cardinal;
{$ifdef HASINLINE}inline;{$endif}
var s1,s2: cardinal;
    i: PtrInt;
const Mask: array[0..3] of cardinal = (0,$ff,$ffff,$ffffff);
begin
  if P<>nil then begin
    s1 := 0;
    s2 := 0;
    for i := 1 to L shr 4 do begin // 16 bytes (4 DWORD) by loop - aligned read
      inc(s1,P^[0]);
      inc(s2,s1);
      inc(s1,P^[1]);
      inc(s2,s1);
      inc(s1,P^[2]);
      inc(s2,s1);
      inc(s1,P^[3]);
      inc(s2,s1);
      inc(PtrUInt(P),16);
    end;
    for i := 1 to (L shr 2)and 3 do begin // 4 bytes (DWORD) by loop
      inc(s1,P^[0]);
      inc(s2,s1);
      inc(PtrUInt(P),4);
    end;
    inc(s1,P^[0] and Mask[L and 3]);      // remaining 0..3 bytes
    inc(s2,s1);
    result := s1 xor (s2 shl 16);
  end else
    result := 0;
end;
begin // use a sub function for better code generation under Delphi
  result := SubHash(Data,Len);
end;

The classic CRC32 version - you can find a very optimized asm version (using 8 tables) here:

function UpdateCrc32(aCRC32: cardinal; inBuf: pointer; inLen: integer) : cardinal;
var i: integer;
begin
  result := aCRC32;
  // if we used a dynamic table, we assume we want shorter code size
  for i := 1 to inLen do begin
    result := crc32Tab[byte(result xor pByte(inBuf)^)] xor (result shr 8);
    inc(cardinal(inBuf));
  end;
end;
Arnaud Bouchez
  • 42,305
  • 3
  • 71
  • 159
  • 2
    Delphi 2009 and up also provides Jenkins Hash (used to hash dictionary entries) in [Generics.Defaults.BobJenkinsHash](http://docwiki.embarcadero.com/VCL/de/Generics.Defaults.BobJenkinsHash) – mjn Nov 25 '11 at 07:18
  • About a comparison of hash functions, take a look at http://www.delphitools.info/2014/08/25/string-hashing-shootout/ In short: crc32 is the best balance between performance and collisions, kr32 has a lot of collisions, and BobJenkinsHash is dead slow. – Arnaud Bouchez Dec 19 '14 at 10:15
3

As Dummy00001 pointed out, this has been asked and answered before. Take a look at Best algorithm for hashing number values?, particularly the suggestion of using MurmurHash.

I'd recommend MurmurHash because:

  1. It's very fast.

  2. Its distribution and avalanche characteristics are excellent for a non-cryptographic hash.

  3. Its worst-case behavior is still pretty good.

I've used it. It doesn't suck.

edit

There was a lot of discussion about how to best port it to Delphi, on https://forums.embarcadero.com/thread.jspa?threadID=13902&tstart=0. The resulting code is available at https://forums.codegear.com/thread.jspa?threadID=14879

Delphi translation

function Murmur2(const S: AnsiString; const Seed: LongWord=$9747b28c): LongWord;
var
    h: LongWord;
    len: LongWord;
    k: LongWord;
    data: Integer;
const
    // 'm' and 'r' are mixing constants generated offline.
    // They're not really 'magic', they just happen to work well.
    m = $5bd1e995;
    r = 24;
begin
    len := Length(S);

    //The default seed, $9747b28c, is from the original C library

    // Initialize the hash to a 'random' value
    h := seed xor len;

    // Mix 4 bytes at a time into the hash
    data := 1;

    while(len >= 4) do
    begin
        k := PLongWord(@S[data])^;

        k := k*m;
        k := k xor (k shr r);
        k := k* m;

        h := h*m;
        h := h xor k;

        data := data+4;
        len := len-4;
    end;

    {   Handle the last few bytes of the input array
            S: ... $69 $18 $2f
    }
    Assert(len <= 3);
    if len = 3 then
        h := h xor (LongWord(s[data+2]) shl 16);
    if len >= 2 then
        h := h xor (LongWord(s[data+1]) shl 8);
    if len >= 1 then
    begin
        h := h xor (LongWord(s[data]));
        h := h * m;
    end;

    // Do a few final mixes of the hash to ensure the last few
    // bytes are well-incorporated.
    h := h xor (h shr 13);
    h := h * m;
    h := h xor (h shr 15);

    Result := h;
end;

Passes all self-tests from the original C implementation.

Community
  • 1
  • 1
Steven Sudit
  • 19,391
  • 1
  • 51
  • 53
  • Any pointers to proof that the hash function doesn't suffer from the avalanche effect you complain elsewhere so much? Murmur uses pretty much the same algorithm, only slightly differently. – Dummy00001 Sep 11 '10 at 10:55
  • 1
    I answered this elsewhere, but I'll repeat it here: http://sites.google.com/site/murmurhash/avalanche – Steven Sudit Sep 11 '10 at 11:11
  • Some people might think that, since the source is the canonical site for MurmurHash, it must be suspect. Those people should read more carefully, and follow the link to http://home.comcast.net/~bretm/hash/ – Steven Sudit Sep 11 '10 at 11:19
  • @Steven: regarding the link - 404 errors when trying to look into at the diagrams at the link. Nor you have provided any link to the sample implementation of advertised algorithm. P.S. the second link (~bretm) shows different avalanche diagram for Jenkins and conclusion is also nice: "This hash function by Bob Jenkins should be suitable for general purpose use, either for hash table lookup, basic file fingerprinting, or other non-cryptographic uses." You are very *trustworthy* source of information ;) – Dummy00001 Sep 11 '10 at 11:32
  • @Dummy: Pay attention: Mulvey’s diagram was for lookup3, not one-at-a-time. – Steven Sudit Sep 11 '10 at 11:38
  • @Steven: Pay attention, I linked to the Wikipedia article which links to several implementation - and plethora of other useful information. Remove down-vote from my answer, re-edit your own answer into something more up-vote-worthy (don't you know how to post links here?), and I would remove my down-vote. – Dummy00001 Sep 11 '10 at 11:55
  • A more comprehensive survey of hash functions, focusing on collision-avoidance and speed, can be found at http://www.strchr.com/hash_functions. – Steven Sudit Sep 11 '10 at 12:00
  • Edit your answer so that it clearly recommends *only* lookup3, and I'll remove the downvote. I will still suggest that lookup3 be avoided, since MurmurHash is at least as good in terms of distribution and avalanche, and yet it's faster. – Steven Sudit Sep 11 '10 at 12:02
  • It turns out the 404 errors are due to the fact that the author is in the middle of moving the site to http://tanjent.com/doku.php?id=murmurhash – Steven Sudit Sep 11 '10 at 12:22
  • Besides http://programmers.stackexchange.com/questions/49550/what-hashing-algorithm-is-good-for-uniqueness-and-speed/145633#145633, I'd like to mention Murmur3. – Steven Sudit Apr 27 '12 at 23:06
  • As soon [as i find some test vectors :)](http://code.google.com/p/smhasher/issues/detail?id=6) – Ian Boyd Apr 27 '12 at 23:39
  • Isn't there a bug in the above Pascal implementation where it loads the last 3 bytes? MurmurHash2.cpp uses switch(len), so the Pascal should use Case len of... If 3 bytes remain, this code will load 3, then load 2, then load 1. – Guy Gordon Aug 22 '12 at 13:18
2

Jenkins hash function should help you get started.

my current method is just adding all the ASCII numbers of the characters together and taking it mod the array size.

You discard important bit of information which is the position of the character in the string. That is a bad idea, since then strings "AB" and "BA" would have same the same hash value.

Instead of simple addition, keeping it primitive, one can use expression like hash = hash*P1 + str[i]*P2 + P3; where Pi are some prime numbers. That's how I do it if I need a hash function quickly. I often use 7, 5 and 3 as the primes, but the numbers should be obviously adjusted (as well as initial value of hash) so that the result of hash function is usable to your task.

For more information read the corresponding (and rather informative) Wikipedia article.

Dummy00001
  • 16,630
  • 5
  • 41
  • 63
  • Sure, start there, but then keep moving. As the article you linked to shows, it has pretty awful avalanche characteristics. Just look at all the yellow pixels. – Steven Sudit Sep 11 '10 at 10:39
  • @Steven: The yellow dots are really few there. And it is definitely better than simple addition. And it is fast and works fine on real world strings. I use it in several places and it definitely improved key distribution. – Dummy00001 Sep 11 '10 at 10:50
  • I don't deny for a moment that it's better than addition -- I even pointed out the AB/BA problem in a comment to an answer that was deleted. However, there's no reason there should be *any* yellow pixels. Projects like memcachedb use Murmur for a good reason! – Steven Sudit Sep 11 '10 at 10:53
  • Don't be silly. There would be always yellow pixels - because one still has to apply modulo to the result of the hash function and that would discard more useful bits of information. And murmur doesn't strike to be much different from Jenkins or FNV. Or you haven't even bothered to check how Murmur works? I think you're being over-pedantic here. – Dummy00001 Sep 11 '10 at 10:57
  • Oh I'm *definitely* pedantic, but I'm also completely right. It's not radically different, but it's designed to avoid this particular flaw. For a good visual comparison, look at http://sites.google.com/site/murmurhash/avalanche – Steven Sudit Sep 11 '10 at 11:10
  • From the one of the [linked articles](http://home.comcast.net/~bretm/hash/7.html): "This hash function by Bob Jenkins should be suitable for general purpose use, either for hash table lookup, basic file fingerprinting, or other non-cryptographic uses. Both the upper and lower bits are uniformly distributed. The hash function achieves avalanche for every bit combination tested." – Dummy00001 Sep 11 '10 at 11:33
  • At risk of repeating myself, I'll point out that this comment was about the lookup3 algorithm, whereas the avalanche graphic we were discussing was for the one-at-a-time algorithm. In short, Mulvey is not backing you up on this one. – Steven Sudit Sep 11 '10 at 12:06
  • side note: Delphi 2009 and up use this funtion to hash dictionary entries - [Generics.Defaults.BobJenkinsHash](http://docwiki.embarcadero.com/VCL/de/Generics.Defaults.BobJenkinsHash) – mjn Nov 25 '11 at 07:14
2

I've tried many fast hash functions and chosen this one:

function StrHash(const st:string):cardinal; 
 var
  i:integer;
 begin
  result:=0;
  for i:=1 to length(st) do
   result:=result*$20844 xor byte(st[i]);
 end;

It is as fast as K&R function (actually even faster) but makes better (more even) distribution.

-3

A very simple method is to just XOR all values. The simplest as far as I know.

Bo Persson
  • 90,663
  • 31
  • 146
  • 203
AGee
  • 1