6

What is the best method to create a hash of String, if the hash may not have more than 4 characters, and those 4 characters may only be lowercase letters or digits?

The strings I want to hash have 1-255 characters. I know that it's probably impossible to create as 4-char hash without collision. But it would be sufficient if I'd have a good hash where possible collisions are minimized.

What I tried is the CRC16CCITT from here: http://introcs.cs.princeton.edu/java/61data/CRC16CCITT.java

public class CRC16CCITT { 

    public static void main(String[] args) { 
        int crc = 0xFFFF;          // initial value
        int polynomial = 0x1021;   // 0001 0000 0010 0001  (0, 5, 12) 

        // byte[] testBytes = "123456789".getBytes("ASCII");

        byte[] bytes = args[0].getBytes();

        for (byte b : bytes) {
            for (int i = 0; i < 8; i++) {
                boolean bit = ((b   >> (7-i) & 1) == 1);
                boolean c15 = ((crc >> 15    & 1) == 1);
                crc <<= 1;
                if (c15 ^ bit) crc ^= polynomial;
            }
        }

        crc &= 0xffff;
        StdOut.println("CRC16-CCITT = " + Integer.toHexString(crc));
    }

}

But this gives too many collision. Are there better algorithms?

membersound
  • 81,582
  • 193
  • 585
  • 1,120
  • 8
    Lowercase letters and numbers means that there are only 36^4 different hashes, so, even with a hashing function that generates uniformly-distributed hashes, you're more likely than not to have collisions once you have ~sqrt(36^4) = 1296 values (by the Birthday paradox). You simply need more possible values in the hash space. – Andy Turner Nov 17 '16 at 09:51
  • might be useful to take a look at this: http://stackoverflow.com/questions/12076846/using-a-larger-prime-as-a-multiplier-when-overriding-hashcode – posdef Nov 17 '16 at 09:55
  • @AndyTurner thanks for clarification. Anyways I'm limited to 4 chars, so I know I'll have nonunique hashes by design. But I'm looking for an algorithm that gives me the best "less likely to collide" hash. – membersound Nov 17 '16 at 10:15
  • 3
    To evaluate your current hash function without doing complicated maths, use a random number generator with output between 0 and 36^4-1 as the hash function on the same set of distinct strings and compare the number of collisions. An optimal hash function should behave like a random number generator on a set of distinct inputs; you may find that your hash function is already close enough to optimal. – tom Nov 17 '16 at 10:22
  • 1
    The title leads almost in the opposite direction: the restriction is on the hash values, not on the string. Define `too many`; give figures from your observation (how many distinct strings, how many collisions). (Use a fast hash giving "big" integers (`String.hashCode()`?), use four characters from [`.toString(hash, 36)`](https://docs.oracle.com/javase/8/docs/api/java/lang/Long.html#toString-long-int-).) – greybeard Nov 17 '16 at 10:36
  • 1
    Your hashes use the digits 0 - 9 and letters 'a' - 'f' rather than the full 36 characters. You need more bits in your CRC to make use of the entire hash space (36^4 = 1679616 available hashes, so at least 21 bits). – tom Nov 17 '16 at 10:42
  • @greybeard sorry I corrected it. – membersound Nov 17 '16 at 10:53
  • @tom could you give a hint how to change the algorithm to use the full character set? – membersound Nov 17 '16 at 10:53
  • Use a 32-bit or 24-bit CRC algorithm, then use `crc %= 36*36*36*36;` to reduce to a valid hash and `Integer.toString(crc, 36)` to encode it. – tom Nov 17 '16 at 10:57
  • (Or use toString on the full CRC then take the last 4 characters instead of using %. Watch out for strings with less than 4 characters.) – tom Nov 17 '16 at 11:06
  • @tom that's both nice. Why would one prefer one approach over the other? – membersound Nov 17 '16 at 11:08
  • 1
    Personal preference, clarity, length of code, how you want to handle padding... – tom Nov 17 '16 at 11:20

1 Answers1

0

You are mistaking "hexadecimal digits" for "characters":

    int crc = 0xFFFF;          // initial value

That's only 2 bytes (0xFF is just 1 byte). For a CRC of 4 ANSI characters, you need 4 bytes (0xFFFFFFFF).
You'll have to adapt the rest of the code to work with double the legth, please comment if you don't know how to do that.

PS: You could do it with less than 4 bytes, but that would complicate things more than necessary.

walen
  • 7,103
  • 2
  • 37
  • 58
  • I'm not experienced in crc algorithms or byte encoding. I just took the example class linked in my question. Would be great if you could give an adapted algorithm according to 4 ansi chars. – membersound Nov 17 '16 at 11:03
  • Have a look at the 32 bit (4 bytes) version, the "direct calculation" part is at the end: http://introcs.cs.princeton.edu/java/61data/CRC32.java.html – walen Nov 17 '16 at 11:09