0

I want to get a unique hex per unique string. As of now I am using Message Digest class but there are possibilities (very less but still) that for two unique strings, computed hex can be same. So is there any other way to do such thing so that I have unique hex for unique strings.

Thanks in advance.

Abhishek Maheshwari
  • 275
  • 1
  • 3
  • 12
  • 1
    Did you mean `hash`? If so: No way. There aren't enough different `int` values to match each string to a different one. Also, `hash` is not meant to be unique. Why do you need truly unique hashes? – tobias_k Jun 10 '15 at 11:42
  • what do you mean by "unique hex"? Hexadecimal is a number system .... – specializt Jun 10 '15 at 11:43
  • 1
    Transform your string to bytes using UTF8, then encode the bytes to Hex. You'll have a unique value. – JB Nizet Jun 10 '15 at 11:43
  • i think you meant "format to hex" because there is no encode being done ... at all. But yes : this will of course retrieve unique hexadecimal number-sequences for unique strings – specializt Jun 10 '15 at 11:44

2 Answers2

1

(Assuming that you meant hash, not hex, since you mentioned MessageDigest).

You can't have a unique hash code for each unique string. Think of it this way: A hash function maps a string (or any other object) to an integer number. Since each integer number can be represented as a string, e.g. "123", there are at least as many strings as there are different integer numbers -- and then some more, like, everything that is not a number, e.g. "Hello". Thus, as there are more strings than integer numbers, it is not possible to generate unique hash codes for unique strings in all cases.

Having said that, for "everyday hashing" (for hash-tables etc.), the hash function provided by String.hashCode is about as good as it gets. For cryptographic hashing, MessageDigest seems to be the way to go. Depending on what you currently use, you might be able to upgrade to a stronger algorithm, though, e.g. sha-512 instead of sha-256.

tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • 1
    See also [Pigeonhole principle](http://en.wikipedia.org/wiki/Pigeonhole_principle). – Jesper Jun 10 '15 at 11:57
  • String.hashCode() is not in the same league as MessageDigest at all. Finding conflicts for hashCode() is trivial (and has, BTW, constituted the principle of DOS attacks), whereas MessageDigest is a cryptographic hash, for which finding conflicts is almost impossible. hashCode() produces an int (32 bits). A MessageDigest typically produces 16-20 bytes. – JB Nizet Jun 10 '15 at 11:59
  • @JBNizet Still, I'd assume that if you want an `int`, then `String.hashCode` is okay? Or has it major flaws that were never fixed, maybe for compatibility reasons? – tobias_k Jun 10 '15 at 12:03
  • It's OK to have as unique as possible integers, but conflicts are easy to obtain. If unicity is the goal, 32 bits are not enough. – JB Nizet Jun 10 '15 at 12:18
  • @JBNizet every checksum of any kind will always produce collisions within a finite amount of time, no matter the circumstances - "almost impossible" is highly exagerrated, one can even produce collisions for UUIDs pretty easily and the same is true for `MessageDigest` - the basic assumption around checksums is : `the end-user is not explicitly searching for collisions` hence most modern checksum-algorithms are pretty sufficient, during the lifespan of most apps and webapps collisions are `unlikely` to occur - but its certainly **not** `almost impossible` – specializt Jun 10 '15 at 12:19
  • Read http://stackoverflow.com/questions/201705/how-many-random-elements-before-md5-produces-collisions. To find a collision with MD5, for example, which is weak and now considered broken, you would have, on average, to hash 6 billion files per second for 100 years. That's what I would call "almost impossible". – JB Nizet Jun 10 '15 at 12:28
  • apparently you dont seem to grasp the difference between "actively searching for collisions / bruteforcing / exploiting algorithms" and "likelihood of passively encountered collision" – specializt Jun 10 '15 at 15:11
  • OK, so I'll rephrase: MessageDigest is a cryptographic hash, for which **encountering** conflicts is almost impossible. Finding collisions, by actively searching for them is extremely hard: http://crypto.stackexchange.com/questions/3049/are-there-any-known-collisions-for-the-sha-2-family-of-hash-functions. Prove me wrong: I let you 7 days to find a collision , with SHA256, for the hash of "hello world". – JB Nizet Jun 10 '15 at 16:51
  • yea you definitely need a reality check sometime, but feel free to live in your internet forum world for now ... – specializt Jun 10 '15 at 19:05
  • @specializt , JBNizet Guys, is this still related to my post? If not, could you duke this out somewhere else, please? Chat, maybe? – tobias_k Jun 10 '15 at 19:18
  • im not "duking out" anything. – specializt Jun 11 '15 at 08:46
0

Convert each character to a hex value:

String hex = Arrays.stream(str.split(""))
  .map(s -> s.charAt(0) + 0)
  .map(c -> String.format("%4x", c))
  .collect(Collectors.joining(""));
Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • That won't compile, and it wouldn't necessary produce unique strings if it compiled, because no left-padding is used. For example, the chars [1, 0] would produce the same value as the chars [16] – JB Nizet Jun 10 '15 at 11:50
  • @jbn yes, I realised the padding thing as soon as I wrote it. I've changed it to string format hex with width. And the compile thing - well yeah... I think I fixed that too. (A little busy putting kids to bed :/) – Bohemian Jun 10 '15 at 12:04