0

I was reading the example in Oracle's online java tutorial that uses HashMap to store anagrams:

    // Read words from file and put into a simulated multimap
    Map<String, List<String>> m = new HashMap<String, List<String>>();

    try {
        Scanner s = new Scanner(new File(args[0]));
        while (s.hasNext()) {
            String word = s.next();
            String alpha = alphabetize(word);
            List<String> l = m.get(alpha);
            if (l == null)
                m.put(alpha, l=new ArrayList<String>());
            l.add(word);
        }
    } catch (IOException e) {
        System.err.println(e);
        System.exit(1);
    }

    // Print all permutation groups above size threshold
    for (List<String> l : m.values())
        if (l.size() >= minGroupSize)
            System.out.println(l.size() + ": " + l);
}

private static String alphabetize(String s) {
    char[] a = s.toCharArray();
    Arrays.sort(a);
    return new String(a);
}

}

Since HashMap is implemented with Hash Table, I think that each sorted alphabetized string should have a unique hash code after compression(otherwise the linked list that stores the values in the HashMap will store a value that's not a anagram of the sorted alphabetized string).

I am not exactly sure how Java's implementation of HashMap can satisfy this - I assume that they use the hash code of a string (a1*31^n-1 + a2*31^n-2 + ... + an). This might guarantee the uniqueness of the hash code if we are talking about strings with only lower case chars. However, one also has to compress the hash code before putting the value of the key to the bucket in the hash table (otherwise you would have a hugggggge hash table that can't be handled in memory, just thinking of how big 31^10 is). Among this compression, I would think that there will be collision. In other words, two different strings that are not true anagram will end up being stored in the same bucket (which should be only used to store a list of true anagrams)...

Could anyone help me to understand what I might be missing? Or if the online tutorial is missing something?

thanks!

Jason

Eric H.
  • 341
  • 2
  • 7
  • 18
  • 1
    http://stackoverflow.com/questions/4980757/how-do-hashtables-deal-with-collisions and in particular: http://stackoverflow.com/a/4980796/829571 – assylias Nov 26 '12 at 17:58
  • Thanks for pointing me to that post. I can understand that hash code is used to locate bucket and the equals() method will be used later to retrieve the correct entry. However, I can't image how to store multiple nodes in a linked list without a next node pointer. If there are 2+ nodes in a linked list without a next node pointer, then the list is no longer "linking" all the nodes and thus that shouldn't be a linked list, right? – Eric H. Nov 26 '12 at 21:36
  • The other question after reading that explanation of how HashMap works is that, for two strings (non anagrams) with same hash code, the example provided in the java tutorial is going to store them in the same bucket. So you are saving something like "abc", "cab" and "def" into the same linked list. Although later you may be able to just print out "abc" and "cab", the "def" is stored at the wrong place and another bucket that expected this "def" is missing the anagram entry. Not sure how this will be addresses by the hashMap implementation. Or maybe that's just the bug from the example? – Eric H. Nov 26 '12 at 21:40

1 Answers1

1

Never, ever assume that hash codes are unique -- but realize that HashMap already knows that hash codes are non-unique, and deals with them appropriately.

In other words, even if a.hashCode() == b.hashCode(), if !a.equals(b), then a HashMap will not mix up the value associated with a and the value associated with b.

Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
  • I understand the first part of your comment. However, I am not sure how HashMap will make sure "if !a.equals(b), then a HashMap will not mix up the value associated with a and the value associated with b.". This is exactly why I am asking the question. Would you mind explain it a bit more? thanks. :P – Eric H. Nov 26 '12 at 21:41
  • 1
    `HashMap` doesn't _just_ look at the hash code of a key, it also checks if the key is `equals` to preexisting keys. So there's no way it could possibly mix up `a` and `b`, since it checks not just the `hashCode` but also `equals`. – Louis Wasserman Nov 26 '12 at 22:03
  • I thought HashMap will only check the keys with equals method when it tries to retrieve key/value from the linked list hold at the bucket. Are you saying that when we put key/value to the HashMap, the implementation also checks the key using the equals method? I thought it would only use HashCode(and suppression function) to decide which bucket to put the entry. IN other words, if two strings have the same hash code (doesn't mean they have to be equal), they will end up at the same bucket. Maybe I am wrong? – Eric H. Nov 27 '12 at 04:12
  • Refer to the source code of HashMap in JDK 1.6 (see http://stackoverflow.com/questions/4980757/how-do-hashtables-deal-with-collisions), I can see that when we tried to get the entry, it will invoke equals method. However, when we add new entry, the implementation will straightly add the new entry at the front of the linked list (and doesn't invoke any equals check). – Eric H. Nov 27 '12 at 04:18
  • That is just plain [false](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java#386). – Louis Wasserman Nov 27 '12 at 14:43
  • Sorry I don't think I understand your response. Basically I agree that when we retrieve information from HashMap, we will not mess up a and b even if a.hashCode()==b.hashCode() (or hash(a.hashCode())==hash(b.hashCode()). However, at the time of insertion, I think the anagram example might put some false positive strings into the linked list. – Eric H. Nov 27 '12 at 18:28
  • Insertion _does check equals_. That's clearly demonstrated in the code I just linked to. If it didn't, that would be a bug worthy of reporting to Oracle. – Louis Wasserman Nov 27 '12 at 18:30
  • Thanks for pointing me to the code - I wasn't aware of such nice code repository. Anyway, in the case of this anagram example, when we insert the new entry (put(key,value)), the implementation checks for equals of key, but then still adds the item to the linked list of the same bucket anyways. So the anagram example replies on the assumption that two different strings (that are not anagrams) won't have the same code after invoking hash(key.hashcode()). This may or may not be true - I am not sure how to prove it one way or the other. Anyway - thanks again for pointing me to the link. – Eric H. Nov 27 '12 at 19:41
  • The point is that having two entries in the same bucket linked list is entirely normal and fine, because even if two entries are in the same bucket, they'll never get mixed up. In other words, `HashMap` is a perfectly fine, working `Map` implementation even if *every* element has a hash code of 0. Or if every element has a hash code of 42. Or any constant. It's just _slower_, not _wrong_. – Louis Wasserman Nov 27 '12 at 20:16
  • I agree - it's perfectly fine to have collision and most of the time we can't avoid it. However, my point is that then in the anagram example this will end up storing "abc" and "def" in the same bucket (if "abc" and "def" hashes to the same bucket). In that case, printing out the content of each linked list won't show a correct collection of the anagrams. – Eric H. Nov 27 '12 at 21:56
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/20200/discussion-between-louis-wasserman-and-jason) – Louis Wasserman Nov 27 '12 at 21:56