How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?

Question

I am wondering how to generate a GUID given an input string, such that the same input string results in the same GUID (sort of like an MD5 hash). The problem with MD5 hashes is they just guarantee low collision rate, rather than uniqueness. Instead I would like something like this:

guid('v1.0.0') == 1231231231123123123112312312311231231231
guid('v1.0.1') == 6154716581615471658161547165816154716581
guid('v1.0.2') == 1883939319188393931918839393191883939319

How would you go about implementing this sort of thing (ideally in JavaScript)? Is it even possible to do? I am not sure where to start. Things like the uuid module don't take a seed string, and they don't let you use a custom format/alphabet.

I am not looking for the canonical UUID format, but rather a GUID, ideally one made up of just integers.

What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 40 digit long strings (such as "123123..."). As you note, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions. — Peter O., Aug 20 '20 at 22:08
What's the expected max size of your input string? Your examples show short strings of 6 characters. Reason I ask is that you might just be able to use the underlying byte string as the corresponding unique BigInt number. — Trentium, Aug 20 '20 at 23:02
In that case, you can't generally map 120-character strings one-to-one onto 40-digit-long strings, without creating duplicates, due to the pigeonhole principle. — Peter O., Aug 21 '20 at 04:17

Peter O. · Answer 1 · 2020-10-11T19:37:55.973

What you would need is define a one-to-one mapping of text strings (such as "v1.0.0") onto 40 digit long strings (such as "123123..."). This is also known as a bijection, although in your case an injection (a simple one-to-one mapping from inputs to outputs, not necessarily onto) may be enough. As you note, hash functions don't necessarily ensure this mapping, but there are other possibilities, such as full-period linear congruential generators (if they take a seed that you can map one-to-one onto input string values), or other reversible functions.

However, if the set of possible input strings is larger than the set of possible output strings, then you can't map all input strings one-to-one with all output strings (without creating duplicates), due to the pigeonhole principle.

For example, you can't generally map all 120-character strings one-to-one with all 40-digit strings unless you restrict the format of the 120-character strings in some way. However, your problem of creating 40-digit output strings can be solved if you can accept limiting input strings to no more than 10⁴⁰ values (about 132 bits), or if you can otherwise exploit redundancy in the input strings so that they are guaranteed to compress losslessly to 40 decimal digits (about 132 bits) or less, which may or may not be possible. See also this question.

The algorithm involves two steps:

First, transform the string to a BigInt by building up the string's charCodeAt() values similarly to the stringToInt method given in another answer. Throw an error if any charCodeAt() is 0x80 or greater, or if the resulting BigInt is equal to or greater than BigInt(alphabet_length)**BigInt(output_length).
Then, transform the integer to another string by taking the mod of the BigInt and the output alphabet's size and replacing each remainder with the corresponding character in the output alphabet, until the BigInt reaches 0.

There will be no more than 10^40 strings, in fact, there will probably only be thousands of strings, maybe millions. But at the same time, I wanted to solve the problem just in case the number got up to 10^40, which is about 256^16, which is similar in size to a UUID. Can you show a solution for that? The ideal solution would be like this: `gen(string, alphabet, size)`, where string is the input, alphabet is the characters (in my case 0-9), and size is 40 in my case but could be 10 or 20 in another case, or any small integer. — Lance, Aug 22 '20 at 00:33
Any 8-bit character in the string, or limited even to ASCII characters. This is stuff that would go into a URL so ASCII is cleaner and more useful than every UTF-8/16 character. Or if it's too complicated and tbh, just the original 0-9 alphabet would be perfectly fine, that's all I'm really going to use it for. — Lance, Aug 22 '20 at 01:30
(Re-posted, because time to edit has passed.) I should note that without further information on what characters are allowed in the input string, the function would have to assume that any 16-bit character could appear in the string, which could limit the maximum length of input strings that the function could map without loss. — Peter O., Aug 22 '20 at 01:44
@PeterO.if I understand correctly, it looks like your last edit can leverage https://stackoverflow.com/questions/61093432/encode-a-big-integer-to-base62/61852309#61852309 , which allows one to define the character set for a given number base and subsequently convert to and from base 10.. — Trentium, Aug 22 '20 at 11:29

samanime · Answer 2 · 2020-08-21T08:22:32.877

One approach would be to use the method from that answer:

/*
 * uuid-timestamp (emitter)
 * UUID v4 based on timestamp
 *
 * Created by tarkh
 * tarkh.com (C) 2020
 * https://stackoverflow.com/a/63344366/1261825
 */
const uuidEmit = () => {
  // Get now time
  const n = Date.now();
  // Generate random
  const r = Math.random(); // <- swap this
  // Stringify now time and generate additional random number
  const s = String(n) + String(~~(r*9e4)+1e4);
  // Form UUID and return it
  return `${s.slice(0,8)}-${s.slice(8,12)}-4${s.slice(12,15)}-${[8,9,'a','b'][~~(r*3)]}${s.slice(15,18)}-${s.slice(s.length-12)}`;
};

// Generate 5 UUIDs
console.log(`${uuidEmit()}
${uuidEmit()}
${uuidEmit()}
${uuidEmit()}
${uuidEmit()}`);

And simply swap out the Math.random() call to a different random function which can take your seed value. (There are numerous algorithms out there for creating a seedable random method, so I won't try prescribing a particular one).

Most random seeds expect numeric, so you could convert a seed string to an integer by just adding up the character values (multiplying each by 10^position so you'll always get a unique number):

const stringToInt = str => 
  Array.prototype.slice.call(str).reduce((result, char, index) => result += char.charCodeAt(0) * (10**(str.length - index)), 0);
  
console.log(stringToInt("v1.0.0"));
console.log(stringToInt("v1.0.1"));
console.log(stringToInt("v1.0.2"));

If you want to generate the same extract string every time, you can take a similar approach to tarkh's uuidEmit() method but get rid of the bits that change:

const strToInt = str => 
      Array.prototype.slice.call(str).reduce((result, char, index) => result += char.charCodeAt(0) * (10**(str.length - index)), 0);

const strToId = (str, len = 40) => {
  // Generate random
  const r = strToInt(str);
  // Multiply the number by some things to get it to the right number of digits
  const rLen = `${r}`.length; // length of r as a string
  
  // If you want to avoid any chance of collision, you can't provide too long of a string
  // If a small chance of collision is okay, you can instead just truncate the string to
  //  your desired length
  if (rLen > len) throw new Error('String too long');
  
  // our string length is n * (r+m) + e = len, so we'll do some math to get n and m
  const mMax = 9; // maximum for the exponent, too much longer and it might be represented as an exponent. If you discover "e" showing up in your string, lower this value
  let m = Math.floor(Math.min(mMax, len / rLen)); // exponent
  let n = Math.floor(len / (m + rLen)); // number of times we repeat r and m
  let e = len - (n * (rLen + m)); // extra to pad us to the right length
    
  return (new Array(n)).fill(0).map((_, i) => String(r * (i * 10**m))).join('')
    + String(10**e);
};

console.log(strToId("v1.0.0"));
console.log(strToId("v1.0.1"));
console.log(strToId("v1.0.2"));
console.log(strToId("v1.0.0") === strToId("v1.0.0")); // check they are the same
console.log(strToId("v1.0.0") === strToId("v1.0.1")); // check they are different

Note, this will only work with smaller strings, (probably about 10 characters top) but it should be able to avoid all collisions. You could tweak it to handle larger strings (remove the multiplying bit from stringToInt) but then you risk collisions.

This doesn't generate the _same output_ for the repeatedly calling the function with the same input, does it? The second one is what I'm looking for, but I would like 40-digit long numbers, how to do that as you can't represent those with integers in JS. — Lance, Aug 20 '20 at 15:23
Any fixed-length "number" you almost always want to represent as a string, in any language. It's just your "alphabet" is only 0-9. — samanime, Aug 21 '20 at 08:05
I've added a possible solution for you which should work at least for your described use-case. — samanime, Aug 21 '20 at 08:26
Is there any way to make it handle 40-character (i.e. "long") strings of digits, _without_ risk of collisions? Or is it mathematically/technically impossible? If so, I would love to know why it's impossible. I wasn't sure if your latest solution addressed this :p — Lance, Aug 21 '20 at 20:43
Again, your problem is impossible to solve in the general case because of the [pigeonhole principle](https://en.wikipedia.org/wiki/pigeonhole_principle). However, your problem of creating 40-digit output strings can be solved if you can accept limiting input strings to no more than 10^40 values (about 132 bits), or if you can otherwise exploit redundancy in the input strings so that they are guaranteed to compress losslessly to 40 decimal digits (about 132 bits) or less, which may or may not be possible. — Peter O., Aug 21 '20 at 23:49
As Peter O. said, it's not possible to have it for all cases. If you have a fixed-length output, you either must restrict your maximum input or you risk collisions (i.e., you can't have an infinite number of possible inputs go to a finite number of output). If you accepted an output of any length, you could avoid it. If you restrict your input, you could also do it. But you can't do both at the same time. — samanime, Aug 22 '20 at 07:38

score 1 · Answer 3 · answered Aug 21 '20 at 21:53

I suggest using MD5...

Following the classic birthday problem, all things being equal, the odds of 2 people sharing a birthday out of a group of 23 people is ( see https://en.wikipedia.org/wiki/Birthday_problem )...

For estimating MD5 collisions, I'm going to simplify the birthday problem formula, erring in the favor of predicting a higher chance of a collision...

Note though that whereas in the birthday problem, a collision is a positive result, in the MD5 problem, a collision is a negative result, and therefore providing higher than expected collision odds provides a conservative estimate of the chance of a MD5 collision. Plus this higher predicted chance can in some way be considered a fudge factor for any uneven distribution in the MD5 output, although I do not believe there is anyway to quantify this without a God computer...

An MD5 hash is 16 bytes long, resulting in a range of 256^16 possible values. Assuming that the MD5 algorithm is generally uniform in its results, lets suppose we create one quadrillion (ie, a million billion or 10^15) unique strings to run through the hash algorithm. Then using the modified formula (to ease the collision calculations and to add a conservative fudge factor), the odds of a collision are...

So, after 10^15 or one quadrillion unique input strings, the estimated odds of a hash collision are on par with the odds of winning the Powerball or the Mega Millions Jackpot (which are on order of 1 in ~300,000,000 per https://www.engineeringbigdata.com/odds-winning-powerball-grand-prize-r/ ).

Note too that 256^16 is 340282366920938463463374607431768211456, which is 39 digits, falling within the desired range of 40 digits.

So, suggest using the MD5 hash ( converting to BigInt ), and if you do run into a collision, I will be more than glad to spot you a lottery ticket, just to have a chance to tap into your luck and split the proceeds...

( Note: I used https://keisan.casio.com/calculator for the calculations. )

Lol, +1 for the fun in the question :) I still may consider this, but I need 40 characters, so 39 is too little, though perhaps this does make life easier. — Lance, Aug 22 '20 at 00:22
@LancePollard you can always prepend with zeros to reach 40 digits. At 40 digits, the value will have to be managed as a BigInt anyways, which means if you need to save this computed hash value, for instance in a database, you will have to convert the BigInt to string which will retain any prepended leading zeros. Plus converting an MD5 32 byte hex value to BigInt is as simple as `BigInt( "0x" + MD5_Hash )`. — Trentium, Aug 22 '20 at 00:50

score 1 · Answer 4 · answered Apr 21 '21 at 20:36

While UUID v4 is just used for random ID generation, UUID v5 is more like a hash for a given input string and namespace. It's perfect for what you describe.

As you already mentioned, You can use this npm package:

npm install uuid

And it's pretty easy to use.

import {v5 as uuidv5} from 'uuid';

// use a UUIDV4 as a unique namespace for your application.
// you can generate one here: https://www.uuidgenerator.net/version4
const UUIDV5_NAMESPACE = '...';

// Finally, provide the input and namespace to get your unique id.
const uniqueId = uuidv5(input, namespace);

How to generate a GUID with a custom alphabet, that behaves similar to an MD5 hash (in JavaScript)?

4 Answers4

Linked

Related