0

I'm trying to implement my own hash function, i add up the ASCII numbers of each string, using java. I find the hash code by finding the mod of the size of the hash table and the sum. size%sum. I was wondering if there was a way to use the same process but reduce collisions, when searching for the string?

Thanks in advance.

Marcello
  • 423
  • 1
  • 5
  • 12
  • 2
    So you're trying to make a String hashcode function faster than the standard one ? Did I read that correctly ? If so what are your requirements regarding this function ? – Denys Séguret Dec 11 '12 at 17:40
  • 1
    Your question is too vague to be answerable. What is "the unicode of each string"? What index do you multiply by 2? Where did you read that? – assylias Dec 11 '12 at 17:40
  • 2
    There is not enough context here, but I've never seen "multiplying the index by 2 [..] make[s] the hash function faster". When making such claims, *please provide a reference*. –  Dec 11 '12 at 17:40
  • 1
    I think it would benefit your question if you included a direct quotation of the claim that you're referring to. Some context wouldn't hurt either. – NPE Dec 11 '12 at 17:40
  • 2
    Are you looking to improve the speed of computing the hash, or to improve the speed of hash lookups based on your new hash function? These two usually go in opposite directions. – Sergey Kalinichenko Dec 11 '12 at 17:43
  • sorry i will edit my answer to make it clearer – Marcello Dec 11 '12 at 17:44
  • 1
    @Marcello Instead of explaining what you do, why don't you simply post your code? – assylias Dec 11 '12 at 17:48
  • 4
    This is a terrible hash function : when you swap two chars the hash is the same. Why don't you simply use the String hashCode function ? – Denys Séguret Dec 11 '12 at 17:51
  • The hash depends on a property of the hash table - what if the hash table size increases? – ignis Dec 11 '12 at 17:53

2 Answers2

6

The Java String.hashcode() makes a tradeoff between being a really good hash function and being as efficient as possible. Simply adding up the character values in a string is not a reliable hash function.

For example, consider the two strings dog and god. Since they both contain a 'd', 'g', and an 'o', no method involving only addition will ever result in a different hash code.

Joshua Bloch, who implemented a good part of Java, discusses the String.hashCode() method in his book Effective Java and talks about how, in versions of Java prior to 1.3, the String.hashCode() function used to consider only 16 characters in a given String. This ran somewhat faster than the current implementation, but resulted is shockingly poor performance in certain situations.

In general, if your specific data set is very well-defined and you could exploit some uniqueness in it, you could probably make a better hash function. For general purpose Strings, good luck.

utopianheaven
  • 1,267
  • 7
  • 18
6

I would look at the code for String and HashMap as these have a low collision rate and don't use % and handle negative numbers.

From the source for String

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

From the source for HashMap

/**
 * Retrieve object hash code and applies a supplemental hash function to the
 * result hash, which defends against poor quality hash functions.  This is
 * critical because HashMap uses power-of-two length hash tables, that
 * otherwise encounter collisions for hashCodes that do not differ
 * in lower bits. Note: Null keys always map to hash 0, thus index 0.
 */
final int hash(Object k) {
    int h = 0;
    if (useAltHashing) {
        if (k instanceof String) {
            return sun.misc.Hashing.stringHash32((String) k);
        }
        h = hashSeed;
    }

    h ^= k.hashCode();

    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

As the HashMap is always a power of 2 in size you can use

        hash = (null != key) ? hash(key) : 0;
        bucketIndex = indexFor(hash, table.length);

and

/**
 * Returns index for hash code h.
 */
static int indexFor(int h, int length) {
    return h & (length-1);
}

Using & is much faster than % and only return positive numbers as length is positive.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • if i was to multiply the sum by 128,it would make less collisions, but why do numbers such as 128 and 37 make a diffence – Marcello Dec 11 '12 at 19:08
  • 2
    Multiply mixes the numbers up and gives you a different pattern. Usually prime numbers are a good idea as these numbers are less likely to appear naturally and create a pattern when there wasn't one. Multiplying by 128 is not a good idea in general as your lowest 7 bits will always be zero effectively reducing the "randomness" of your hash value. – Peter Lawrey Dec 12 '12 at 08:24