7

I understand that this seems to be already discussed and the answer is yes, String.hashCode can generate equal vales for different strings, but quite unlikely (Can Java's hashCode produce same value for different strings?). However it does happen in my application.

The following code will produce the same hashcode: -347019262 (jave 1.7.25)

String string1 = "/m/06qw_";
String string2="/m/0859_";
System.out.println(string1+","+string1.hashCode());
System.out.println(string2+","+string2.hashCode());

I do need hashcode in this case, and I want to use it to generate a unique primary key for a string. it seems that I am not doing it right. Any suggestions please?

Many thanks!

Community
  • 1
  • 1
Ziqi
  • 2,445
  • 5
  • 38
  • 65
  • The hash code of 2 string can be same but not equals – Kick Mar 11 '14 at 10:52
  • You can try something with `System.identityHashCode()`. I think that it is supposed to be unique among the JVM (But I am not sure about that so I don't post it as an answer). – Arnaud Denoyelle Mar 11 '14 at 10:55
  • @ArnaudDenoyelle basically, this will return the result of `Object`'s `.hashCode()` on any instance – fge Mar 11 '14 at 10:55
  • Javadoc about `Object.hashCode()` says `class Object does return distinct integers for distinct objects` so calling `System.identityHashCode()` would work in a single JVM. – Arnaud Denoyelle Mar 11 '14 at 11:03
  • 2
    you cannot ensure unique hashCodes, nor do you need to, nor would it prevent collisions in a hash map for example. Unique hashCodes can still give you collisions. – Peter Lawrey Mar 11 '14 at 11:11
  • See also http://stackoverflow.com/questions/1329639/create-a-unique-primary-key-hash-from-database-columns – Raedwald Mar 14 '14 at 08:23

5 Answers5

13

You misunderstand .hashCode().

One part of the contract is that objects who are equals() must have the same hashCode(). However, the reverse is not true: two objects who have the same hashCode() do not have to be equals().

This is a valid, albeit perfectly useless, hashCode() implementation:

@Override
public int hashCode()
{
    return 42; // universal answer
}

You should use the string itself as the "primary key". If you want a "more efficient" key, you should consider what format the input string is and, if possible, extract a significant part of this input.

fge
  • 119,121
  • 33
  • 254
  • 329
4

The sensible option is to use the string as the primary key. (Another choice would be to associate a GUID with your data record and have that as the primary key.)

Hashing is meant to be (1) fast and (2) such that two equal strings will have the same hash code.

I'd submit it's likely that you'll get hashing clashes; after all an int (the hash return type) only has about 4 billion distinct values.

Bathsheba
  • 231,907
  • 34
  • 361
  • 483
4

I do need hashcode in this case, and I want to use it to generate a unique primary key for a string. it seems that I am not doing it right. Any suggestions please?

You should always be cautious about using hash values primary keys. They are not unique. And the smaller the range of the hash function, the worse the problem is.

In your case, hashcode (and the identityHashcode() method suggested in a comment) generate a 32 bit value. For any pair of two different randomly generated strings, there is a chance of 1 in 2^32 that the hashcodes will be the same. This is true for any method of generating (32 bit) hash codes.

Now a chance of (roughly) 1 in 2 billion does not sound like much. But you don't need just pair-wise uniqueness. You actually need all of your strings' hashcodes to be unique ... because you are attempting to use the hashcodes as primary keys, and primary keys must be unique. And the table on the Wikipedia page "birthday problem" says that you only need roughly 50,000 keys before the probability of a collision rises to 1 in 4. (Yes ... ONE in FOUR!)

In short, don't use hashcode() values as primary keys.

The same table, indicates a good hash function that generated 128 bit hash values would probably be good enough to avoid collisions. But checkout the probabilities for yourself and make your own judgement.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
2

You can use SHA1 hash algorithm to reduce collision probability. Take a look of this snippets to see, how to calculate SHA1 hash in Java: http://www.sha1-online.com/sha1-java/

Boris Brodski
  • 8,425
  • 4
  • 40
  • 55
2

You could use

System.identityHashcode(Object);

to get the unique results.

EDIT

I thought that may be the murmur hash guava's implementation could help here also:

 HashFunction hash = Hashing.murmur3_128();
 hash.hashString("/m/06qw_", Charset.defaultCharset()).asInt();

Generally murmur hash is supposed to be fast and reliable.

Eugene
  • 117,005
  • 15
  • 201
  • 306