1

From a given string I am generating 32 digit unique hash code using MD5

    MessageDigest.getInstance("MD5")
             .digest("SOME-BIG-STRING").map("%02x".format(_)).mkString

    //output: 47a8899bdd7213fb1baab6cd493474b4

Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?

Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability?

Security is not important, uniqueness is required.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Sumeet Kumar Yadav
  • 11,912
  • 6
  • 43
  • 80
  • Does this answer your question? [Hash function to produce a code of 30 chars?](https://stackoverflow.com/questions/3536617/hash-function-to-produce-a-code-of-30-chars) – Pat. ANDRIA Dec 11 '20 at 12:40
  • 1
    md5 is insecure. There are ways to generate colliding inputs on purpose. If your code relies on md5 to not produce collisions, that may be unwarranted. Especially if there is user-supplied data involved in any way. –  Dec 11 '20 at 12:44
  • I have 1 trillion unique strings and want to generate unique hash corresponding to each string . What could be best way to do so ? – Sumeet Kumar Yadav Dec 11 '20 at 12:49

2 Answers2

1

Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?

  1. Yes.

  2. You increase the probability of a collision by a factor of 28.

Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability ?

Probably. Taking the first 30 hex digits of a hash produced by any crypto-strength hash algorithm has roughly equivalent uniqueness properties.

Security is not important, uniqueness is required ?

In that case, the fact that MD5 is no longer considered secure is moot. (Note that the reason that MD5 is no longer considered secure is that it is computationally feasible to engineer a collision; i.e. to find a second input for a given MD5 hash.)

However, uniqueness of hashes cannot be guaranteed. Even with a "perfect" crypto strength hash function that generates N bit hashes, the probability of a collision for any 2 arbitrary (different) inputs is one in 2N. For large enough values of N, the probability is very small. But it is never zero.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
1

For generating unique IDs from strings, hash functions are never the correct answer.

What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 30-character-long strings (such as "123123..."). This is also known as a bijection, although in your case a injection (a simple one-to-one mapping from inputs to outputs, not necessarily onto) may be enough. As the other answer at the time of this writing notes, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as full-period linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions.

However, if the set of possible input strings is larger than the set of possible output strings, then you can't map all input strings one-to-one with all output strings (without creating duplicates), due to the pigeonhole principle.

See also this question: How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?.

Indeed, if you use hash functions, the chance of collision will be close to zero but never exactly zero (meaning that the risk of duplicates will always be there). If we take MD5 as an example (which produces any of 2^128 hash codes), then roughly speaking, the chance of accidental collision becomes non-negligible only after 2^64 IDs are generated, which is well over 1 trillion.

But MD5 and other hash functions are not the right way to do what you want to do. This is discussed next.


If you can't restrict the format of your input strings to 30 digits and can't compress them to 30 digits or less and can't tolerate the risk of duplicates, then the next best thing is to create a database table mapping your input strings to randomly generated IDs.

This database table should have two columns: one column holds your input strings (e.g., "<UUID>-NAME-<UUID>"), and the other column holds randomly generated IDs associated with those strings. Since random numbers don't ensure uniqueness, every time you create a new random ID you will need to check whether the random ID already exists in the database, and if it does exist, try a new random ID (but the chance that a duplicate is found will shrink as the size of the ID grows).

Peter O.
  • 32,158
  • 14
  • 82
  • 96
  • I have a 1 trillion distinct string and each string has 800 length long . For each string i want to generate unique hash of 30 length . – Sumeet Kumar Yadav Dec 11 '20 at 14:21
  • In that case, you can't generally map all 800-character strings one-to-one with all 30-digit strings (by the pigeonhole principle) unless you restrict the format of the 800-character strings in some way. You will have to somehow exploit redundancy in your 800-character input strings so that they are guaranteed to compress losslessly to 30 decimal digits, and this may or may not be possible. However, you _can_ map 1 trillion unique values comfortably into strings of 14 or more decimal digits each. What is the format of your input strings? – Peter O. Dec 11 '20 at 14:26
  • (Cont'd.) And can your application tolerate the risk of producing the same ID for different input strings? – Peter O. Dec 11 '20 at 14:29
  • 47a8899bdd7213fb1baab6cd493474b4-name-47a8899bdd7213fb1baab6cd493474b4........ – Sumeet Kumar Yadav Dec 11 '20 at 14:30
  • -NAME-- like this 10 time is each string and application can not take risk of same id twice – Sumeet Kumar Yadav Dec 11 '20 at 14:32
  • UUID is data base UUID generated 32 digit long – Sumeet Kumar Yadav Dec 11 '20 at 14:32
  • In that case, if you can't restrict the format of your input strings to 30-digits and can't compress them to 30 digits or less and can't tolerate the risk of duplicates, then the next best thing is to create a database table mapping your input strings to randomly generated IDs. – Peter O. Dec 11 '20 at 14:34
  • How does database UUID generate a unique every time ? – Sumeet Kumar Yadav Dec 11 '20 at 14:38
  • If i use hash function like md5 what will be my chances of collision in my use case ? – Sumeet Kumar Yadav Dec 11 '20 at 14:39
  • The chance of collision will be close to zero but never exactly zero (meaning that the risk of duplicates will always be there); roughly speaking, the chance of _accidental_ collision becomes non-negligible only after 2^64 IDs are generated, which is well over 1 trillion. But MD5 and other hash functions are not the right way to do what you want to do; I've edited my answer. – Peter O. Dec 11 '20 at 14:42
  • https://en.wikipedia.org/wiki/Universally_unique_identifier <- UUID generation can involve a fingerprint of the generating machine, a timestamp and/or random data, exact details depend on the database in question –  Dec 11 '20 at 14:52