Hash Function MD5 30 length

Question

From a given string I am generating 32 digit unique hash code using MD5

    MessageDigest.getInstance("MD5")
             .digest("SOME-BIG-STRING").map("%02x".format(_)).mkString

    //output: 47a8899bdd7213fb1baab6cd493474b4

Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?

Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability?

Security is not important, uniqueness is required.

Does this answer your question? [Hash function to produce a code of 30 chars?](https://stackoverflow.com/questions/3536617/hash-function-to-produce-a-code-of-30-chars) — Pat. ANDRIA, Dec 11 '20 at 12:40
md5 is insecure. There are ways to generate colliding inputs on purpose. If your code relies on md5 to not produce collisions, that may be unwarranted. Especially if there is user-supplied data involved in any way. — , Dec 11 '20 at 12:44
I have 1 trillion unique strings and want to generate unique hash corresponding to each string . What could be best way to do so ? — Sumeet Kumar Yadav, Dec 11 '20 at 12:49

Stephen C · Answer 1 · 2020-12-11T13:09:54.910

1

Is it possible to generate 30 digit long instead of 32 digit and what will be problem if it do so?

Yes.
You increase the probability of a collision by a factor of 2⁸.

Any another hash algorithm to use to support 30 character long and 1 trillion unique strings collision probability ?

Probably. Taking the first 30 hex digits of a hash produced by any crypto-strength hash algorithm has roughly equivalent uniqueness properties.

Security is not important, uniqueness is required ?

In that case, the fact that MD5 is no longer considered secure is moot. (Note that the reason that MD5 is no longer considered secure is that it is computationally feasible to engineer a collision; i.e. to find a second input for a given MD5 hash.)

However, uniqueness of hashes cannot be guaranteed. Even with a "perfect" crypto strength hash function that generates N bit hashes, the probability of a collision for any 2 arbitrary (different) inputs is one in 2^N. For large enough values of N, the probability is very small. But it is never zero.

edited Dec 11 '20 at 13:09

answered Dec 11 '20 at 13:02

Stephen C

698,415
94
811
1,216

So if I generate 30 digit md5 hash then probability of collision is 1/ 2^30 ? Am i correct – Sumeet Kumar Yadav Dec 11 '20 at 13:10
No. A hex digit is 4 bits. 1 in 2^120 – Stephen C Dec 11 '20 at 13:18
Probably same probability could be happening with UUID https://stackoverflow.com/questions/24876188/how-big-is-the-chance-to-get-a-java-uuid-randomuuid-collision – Sumeet Kumar Yadav Dec 11 '20 at 13:23
1

I am not aware of one. But as I said, it is trivial to implement. Use a regular MD5 to generate hash, and discard 2 hex digits / 8 bits / 1 byte. – Stephen C Dec 11 '20 at 13:23
Like birthday paradox problem , can you help me find in how many records it will have at-least one collision ? – Sumeet Kumar Yadav Dec 11 '20 at 13:33
Try reading the wikipedia page on the birthday paradox. – Stephen C Dec 11 '20 at 13:35

Peter O. · Accepted Answer · 2021-03-25T00:10:04.593

1

For generating unique IDs from strings, hash functions are never the correct answer.

What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 30-character-long strings (such as "123123..."). This is also known as a bijection, although in your case a injection (a simple one-to-one mapping from inputs to outputs, not necessarily onto) may be enough. As the other answer at the time of this writing notes, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as full-period linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions.

However, if the set of possible input strings is larger than the set of possible output strings, then you can't map all input strings one-to-one with all output strings (without creating duplicates), due to the pigeonhole principle.

See also this question: How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?.

Indeed, if you use hash functions, the chance of collision will be close to zero but never exactly zero (meaning that the risk of duplicates will always be there). If we take MD5 as an example (which produces any of 2^128 hash codes), then roughly speaking, the chance of accidental collision becomes non-negligible only after 2^64 IDs are generated, which is well over 1 trillion.

But MD5 and other hash functions are not the right way to do what you want to do. This is discussed next.

If you can't restrict the format of your input strings to 30 digits and can't compress them to 30 digits or less and can't tolerate the risk of duplicates, then the next best thing is to create a database table mapping your input strings to randomly generated IDs.

This database table should have two columns: one column holds your input strings (e.g., "<UUID>-NAME-<UUID>"), and the other column holds randomly generated IDs associated with those strings. Since random numbers don't ensure uniqueness, every time you create a new random ID you will need to check whether the random ID already exists in the database, and if it does exist, try a new random ID (but the chance that a duplicate is found will shrink as the size of the ID grows).

edited Mar 25 '21 at 00:10

answered Dec 11 '20 at 14:02

Peter O.

32,158
14
82
96

I have a 1 trillion distinct string and each string has 800 length long . For each string i want to generate unique hash of 30 length . – Sumeet Kumar Yadav Dec 11 '20 at 14:21
In that case, you can't generally map all 800-character strings one-to-one with all 30-digit strings (by the pigeonhole principle) unless you restrict the format of the 800-character strings in some way. You will have to somehow exploit redundancy in your 800-character input strings so that they are guaranteed to compress losslessly to 30 decimal digits, and this may or may not be possible. However, you _can_ map 1 trillion unique values comfortably into strings of 14 or more decimal digits each. What is the format of your input strings? – Peter O. Dec 11 '20 at 14:26
(Cont'd.) And can your application tolerate the risk of producing the same ID for different input strings? – Peter O. Dec 11 '20 at 14:29
47a8899bdd7213fb1baab6cd493474b4-name-47a8899bdd7213fb1baab6cd493474b4........ – Sumeet Kumar Yadav Dec 11 '20 at 14:30
-NAME-- like this 10 time is each string and application can not take risk of same id twice – Sumeet Kumar Yadav Dec 11 '20 at 14:32
UUID is data base UUID generated 32 digit long – Sumeet Kumar Yadav Dec 11 '20 at 14:32
In that case, if you can't restrict the format of your input strings to 30-digits and can't compress them to 30 digits or less and can't tolerate the risk of duplicates, then the next best thing is to create a database table mapping your input strings to randomly generated IDs. – Peter O. Dec 11 '20 at 14:34
How does database UUID generate a unique every time ? – Sumeet Kumar Yadav Dec 11 '20 at 14:38
If i use hash function like md5 what will be my chances of collision in my use case ? – Sumeet Kumar Yadav Dec 11 '20 at 14:39
The chance of collision will be close to zero but never exactly zero (meaning that the risk of duplicates will always be there); roughly speaking, the chance of _accidental_ collision becomes non-negligible only after 2^64 IDs are generated, which is well over 1 trillion. But MD5 and other hash functions are not the right way to do what you want to do; I've edited my answer. – Peter O. Dec 11 '20 at 14:42
https://en.wikipedia.org/wiki/Universally_unique_identifier <- UUID generation can involve a fingerprint of the generating machine, a timestamp and/or random data, exact details depend on the database in question – Dec 11 '20 at 14:52

Hash Function MD5 30 length

2 Answers2

Linked