0

I need to compare two large strings. Rather than using an equals method like that, is there a way like a hashCode or something which generates a unique id for String? That is because my String is very large. Also, I need distinct content unique Id. Is that possible to use hashCode in String for my purpose.

Sanka
  • 1,294
  • 1
  • 11
  • 20
  • 1
    Does this answer your question? [Why do I need to override the equals and hashCode methods in Java?](https://stackoverflow.com/questions/2265503/why-do-i-need-to-override-the-equals-and-hashcode-methods-in-java). Short answer: String already has fast `hashCode()` implemented, which is used to speed up comparisons. :) However `hashCode()` is not unique, so you cannot avoid sometimes calling `equals()`. – Šimon Kocúrek Mar 25 '20 at 20:04
  • 1
    You'd be better off just using the [UUID](https://docs.oracle.com/javase/7/docs/api/java/util/UUID.html) class to generate a unique identifier. – Tim Hunter Mar 25 '20 at 20:04
  • 1
    UUID won't work for comparing strings. No hashcode won't work either, though in many cases they will speed up comparisons, you'll have to compare the whole string to determine if the string is equal (not equal strings though will compare as false very quickly). – markspace Mar 25 '20 at 20:05
  • 1
    Also, the more I think about this the more it seems to be premature optimization. The first thing a string compare normally checks is the length, which is a very fast operation. Complicated systems to "make it faster" are likely to be unneeded, or even make code slower. Profile the code first, then only add complication if it is needed when a full system is running. – markspace Mar 25 '20 at 20:15

1 Answers1

1

The purpose of hashCode is to provide a quick means of identifying most of the circumstances where two objects would compare unequal. A hash function which has a 1% false positive rate would for most purposes be considered superior to one that has a 0% false positive rate, but takes twice as long.

There are some hashing functions which are designed for use as "digests", such that two different strings of arbitrary length would be very unlikely to have the same digest. In order to be very effective, however, digests need to be much larger than a 32-bit hashcode value. A well-designed 64-byte (512 bit) digest would generally be adequate to guard strings of any length well enough that one would be more likely to get struck by lightning twice on the same weekend as one wins five state lotteries than to find two different strings that yield the same digest. The cost of computing a good digest function for a string would be much greater than that of comparing the string to another string, but if each string will be compared against many other strings, computing each digest function once and comparing it to the digests of every other string may offer a major performance win.

supercat
  • 77,689
  • 9
  • 166
  • 211