How to use String.hashCode to generate primary keys

Question

I understand that this seems to be already discussed and the answer is yes, String.hashCode can generate equal vales for different strings, but quite unlikely (Can Java's hashCode produce same value for different strings?). However it does happen in my application.

The following code will produce the same hashcode: -347019262 (jave 1.7.25)

String string1 = "/m/06qw_";
String string2="/m/0859_";
System.out.println(string1+","+string1.hashCode());
System.out.println(string2+","+string2.hashCode());

I do need hashcode in this case, and I want to use it to generate a unique primary key for a string. it seems that I am not doing it right. Any suggestions please?

Many thanks!

You can try something with `System.identityHashCode()`. I think that it is supposed to be unique among the JVM (But I am not sure about that so I don't post it as an answer). — Arnaud Denoyelle, Mar 11 '14 at 10:55
@ArnaudDenoyelle basically, this will return the result of `Object`'s `.hashCode()` on any instance — fge, Mar 11 '14 at 10:55
Javadoc about `Object.hashCode()` says `class Object does return distinct integers for distinct objects` so calling `System.identityHashCode()` would work in a single JVM. — Arnaud Denoyelle, Mar 11 '14 at 11:03
you cannot ensure unique hashCodes, nor do you need to, nor would it prevent collisions in a hash map for example. Unique hashCodes can still give you collisions. — Peter Lawrey, Mar 11 '14 at 11:11
See also http://stackoverflow.com/questions/1329639/create-a-unique-primary-key-hash-from-database-columns — Raedwald, Mar 14 '14 at 08:23

score 13 · Answer 1 · answered Mar 11 '14 at 10:50

You misunderstand .hashCode().

One part of the contract is that objects who are equals() must have the same hashCode(). However, the reverse is not true: two objects who have the same hashCode() do not have to be equals().

This is a valid, albeit perfectly useless, hashCode() implementation:

@Override
public int hashCode()
{
    return 42; // universal answer
}

You should use the string itself as the "primary key". If you want a "more efficient" key, you should consider what format the input string is and, if possible, extract a significant part of this input.

score 4 · Answer 2 · answered Mar 11 '14 at 10:50

The sensible option is to use the string as the primary key. (Another choice would be to associate a GUID with your data record and have that as the primary key.)

Hashing is meant to be (1) fast and (2) such that two equal strings will have the same hash code.

I'd submit it's likely that you'll get hashing clashes; after all an int (the hash return type) only has about 4 billion distinct values.

Stephen C · Answer 3 · 2014-03-11T11:55:04.353

I do need hashcode in this case, and I want to use it to generate a unique primary key for a string. it seems that I am not doing it right. Any suggestions please?

You should always be cautious about using hash values primary keys. They are not unique. And the smaller the range of the hash function, the worse the problem is.

In your case, hashcode (and the identityHashcode() method suggested in a comment) generate a 32 bit value. For any pair of two different randomly generated strings, there is a chance of 1 in 2^32 that the hashcodes will be the same. This is true for any method of generating (32 bit) hash codes.

Now a chance of (roughly) 1 in 2 billion does not sound like much. But you don't need just pair-wise uniqueness. You actually need all of your strings' hashcodes to be unique ... because you are attempting to use the hashcodes as primary keys, and primary keys must be unique. And the table on the Wikipedia page "birthday problem" says that you only need roughly 50,000 keys before the probability of a collision rises to 1 in 4. (Yes ... ONE in FOUR!)

In short, don't use hashcode() values as primary keys.

The same table, indicates a good hash function that generated 128 bit hash values would probably be good enough to avoid collisions. But checkout the probabilities for yourself and make your own judgement.

score 2 · Answer 4 · answered Mar 11 '14 at 10:51

2

You can use SHA1 hash algorithm to reduce collision probability. Take a look of this snippets to see, how to calculate SHA1 hash in Java: http://www.sha1-online.com/sha1-java/

answered Mar 11 '14 at 10:51

Boris Brodski

8,425
4
40
55

Thank you all, I think I can make good use of all your comments. – Ziqi Mar 11 '14 at 11:12

Eugene · Answer 5 · 2014-03-11T11:03:35.233

2

You could use

System.identityHashcode(Object);

to get the unique results.

EDIT

I thought that may be the murmur hash guava's implementation could help here also:

 HashFunction hash = Hashing.murmur3_128();
 hash.hashString("/m/06qw_", Charset.defaultCharset()).asInt();

Generally murmur hash is supposed to be fast and reliable.

edited Mar 11 '14 at 11:03

answered Mar 11 '14 at 10:53

Eugene

117,005
15
201
306

+1, but you *could* still get clashes: e.g. in an `rmi` environment. – Bathsheba Mar 11 '14 at 10:54
1

Except that this will produce different results for `"test"` and `new String("test")`... – fge Mar 11 '14 at 10:54
@fge agree. May be murmur hash could help also – Eugene Mar 11 '14 at 11:04
Good one about using murmur, however `Charset.defaultCharset()` not so good: it depends on the JVM – fge Mar 11 '14 at 11:19

How to use String.hashCode to generate primary keys

5 Answers5