4

I have about 50,000 words that I want to map each of them to a 16bit number and I'm seeking for a hash function to run on j2me. To be more specific I'm looking for a hash function with below criteria:

  1. few (or no) collisions
  2. light CPU load
  3. I have all of the words now
  4. Avalanche effect is not important, since it's not about security. It' just a look-up table.

I've tested java Strign.hashCode(), murmur hash, jenkins one at a time and a few simple hand-made ones but all of them have at least 30% collisions.
The minimal perfect hashing seems to have heavy CPU load for a small mobile phone too.

Can anybody help me with this?

note: As you know murmur algorithm needs a seed and different seeds have different uniformity. How can I find the seed with minimum collisions?

Thank you in advance

mohsenof
  • 109
  • 2
  • 11
  • Would other data structures work for you? For example a trie? – Omri Barel Aug 20 '11 at 11:34
  • This could be of interest for you: [Fastest possible string key lookup for known set of keys](http://stackoverflow.com/q/6714715/805681) and [Directed acyclic word graph](http://en.wikipedia.org/wiki/Acyclic_deterministic_finite_automaton) – Jiri Kriz Aug 20 '11 at 13:38
  • @Omri Barel: Thank you for the comment. I want to minimize memory accesses. I guess if I can find a good hash function it would be much faster and accesses memory much less. – mohsenof Aug 21 '11 at 08:24

3 Answers3

0

Here is the function I use in C# to map file name to 16 bit number. In my tests it performed better than Pearson hashing.

    public static unsafe int Get16BitHash(string str)
    {
        int hash = 0;
        int len = str.Length;

        fixed (char* ch = str)
        {
            for (int i = 0; i < len; i++)
            {
                hash = hash + ((hash) << 5) + *(ch + i) + ((*(ch + i)) << 7);
            }
        }

        return ((hash) ^ (hash >> 16)) & 0xffff;
    }
ialiashkevich
  • 615
  • 7
  • 8
  • If you are returning the data type int, this is still returning a 32 bit number, not 16? – Jake Drew Oct 09 '15 at 18:52
  • It is preferable to perform calculations with 32 bit integers for performance reasons. The returned 32 bit integer has only low 16 bit, the rest high 16 bit are all zeros. – ialiashkevich Oct 21 '15 at 01:21
0

You could look into an old-fashioned CRC. They are very fast and reasonably collision-free. Just not quite at 16 bits, as this experiment seems to indicate. But nevertheless, you might give it a try, maybe it's good enough for your purposes.

emboss
  • 38,880
  • 7
  • 101
  • 108
0

This answer might be late, but for reference, MurmurHash 3 is sufficiently fast to satisfy your speed criteria. However due to the constraint you have imposed, collisions will be quite common, since 16 bits can represent a range of 65536 values, your 50000 words would create some collisions.

Solutions:

  • use 20+ bits for the key (with 32 bits, there is one collision in a few million samples)
  • write a test program to find a fit seed for 16 bits, here are some useful tools: http://code.google.com/p/smhasher/
Andrei
  • 529
  • 5
  • 12