7

This has been asked numerous times here on SO. But I haven't found a solution for my problem.

I want to create a short hash (let's say max 8 chars) for an invitation system. I cannot use base[X] encoding because that would be too easy to guess. I cannot just trim extra characters of e.g. an MD5 hash, because I think the problem of collisions will come up at some time then.

Is there a solution for this?

Aurelio De Rosa
  • 21,856
  • 8
  • 48
  • 71
PeeHaa
  • 71,436
  • 58
  • 190
  • 262
  • 3
    If you are using a hash, then there is always the possibility of a collision. With 8 characters of an MD5 hash, that a 1 in 4 billion chance of a collision (assuming the input data is random). – Oliver Charlesworth Dec 04 '11 at 16:27
  • A string lenght of 8 as well as 32, 64 and so on, will be always limited. – Aurelio De Rosa Dec 04 '11 at 16:27
  • Is 40 characters really too long? `md5()`, despite the "chance" of collision should suffice I would think. Besides performing a look-up for uniqueness I can't see a flawless system, especially for 8 characters. – Dan Lugg Dec 04 '11 at 16:30
  • 1
    Short hashes inevitably come with increased collision probabilities. – wnrph Dec 04 '11 at 16:35
  • I believe the #1 solution in similar questions has been that you should generate a random string and check that it's unique before saving it to a database. If it's not unique, regenerate and try again. – JJJ Dec 04 '11 at 16:41
  • @OliCharlesworth: yeah but if you trim the hash the possiblity will become a lot greater – PeeHaa Dec 04 '11 at 17:09
  • @PeeHaa: You say "8 chars" in your question; that is a 1 in 4 billion chance. 4 chars would be 1 in 65536 chance. This is true, no matter what hash function you choose. – Oliver Charlesworth Dec 04 '11 at 17:11
  • @OliCharlesworth that's only the case if the numbers would be really random. I don't think md5 is really random in that the chance of hitting a 1 for the first character is a big as hitting a 2. I miht be wrong though – PeeHaa Dec 04 '11 at 17:13
  • @PeeHaa: MD5 isn't random at all. Perhaps you mean "uniform"? – Oliver Charlesworth Dec 04 '11 at 17:14
  • @OliCharlesworth: no i mean random. You said the chance of a collision is 1 in 4 billion. However that is only the case if the numbers would be truly random. – PeeHaa Dec 04 '11 at 17:15
  • 2
    @PeeHaa: MD5 is a deterministic algorithm; it introduces no randomness. Whether or not your output numbers are random depends on what input you give it. Assuming your input is random and uniformly-distributed, then the output of a (truncated) MD5 hash will be random and uniformly-distributed. – Oliver Charlesworth Dec 04 '11 at 17:16

4 Answers4

2

The shortest useful hash algorithm would be md5. Md5 generates 16 bytes=128 bit hash. if you use base 64 encoding, that is, 6 useful bits per byte/char.

You should be able to reduce the md5 to 22 characters (leaving the trailing padding introduced by b64).

This has an added advantage of using the same for legal filenames. You will have to substitute the default / and + characters with any other symbol which does not clash with file naming convention of your os.

Base64 (by replacing / and +) ensures your hash does not mess up the url with special characters.

iwein
  • 25,788
  • 10
  • 70
  • 111
geekay
  • 340
  • 1
  • 5
2

If you want to be assured of never having a collision, your best bet is to maintain a database of valid hashes and compare against that database when generating new hashes.

If you think you will have a high volume, you may want to pre-generate the hashes so that you have a "haystack" of them ready to use. Some people do this with random numbers because hardware random number generators can only produce numbers at a certain rate.

user984869
  • 432
  • 2
  • 8
1

If you want your invitation code to be unique (100% safe from collisions) and hard to guess at the same time, you can make it up of two parts, one being unique and the other being hard to guess. It will not be a hash per se, but it will look cryptic enough for the recipient.

// Thanks to https://stackoverflow.com/questions/4356289/php-random-string-generator
function generateRandomString($length, $characters)
{
    $charactersLength = strlen($characters);
    $randomString = '';
    for ($i = 0; $i < $length; $i++) {
        $randomString .= $characters[rand(0, $charactersLength - 1)];
    }
    return $randomString;
}

function generateUniqueHardToGuessCode($length, $id)
{
    $allowedCharacters = '123456789ABCDEFGHJKLMNPQRSTUVWXYZ';
    // exclude 0, O and I from allowed characters to avoid confusion:
    // 0/O and I/l pairs can look very similar in some fonts like Arial
    $uniquePart = strtoupper(base_convert($id, 10, 32));
    // base_convert(.., .., 32) returns the string of the following
    // 32 characters: "0123456789abcdefghijklmnopqrstuv"
    // "wxy" characters are left off to replace "0OI" we want to exclude,
    // "z" character will serve as a separator between random and unique
    // parts to prevent situations when shorter unique part combined
    // with random characters happens to match the longer unique part
    // of another code, e.g.:
    // ABC (unique) + DEFG (random) = ABCD (unique) + EFG (random)
    $uniquePart = strtr($uniquePart, '0OI', 'WXY');
    $randomPartLength = $length - strlen($uniquePart) - 1; // 1 for separator
    if ($randomPartLength < 1) {
        throw new Exception("The length of $length characters is not enough to create hard to guess code for ID $id");
    }
    $randomPart = generateRandomString($randomPartLength, $allowedCharacters);
    return $randomPart . 'Z' . $uniquePart;
}

for ($id = 0; $id < 10; $id++) {
    echo generateUniqueHardToGuessCode(8, $id), PHP_EOL;
}

The above snippet will output invitation codes like this:

 A33UAEZW
 DCBY6EZ1
 985Z17Z2
 REBYBTZ3
 XLLRGTZ4
 AEP5WBZ5
 UKQNGNZ6
 CTHRTXZ7
 CRTAWKZ8
 GJB9PXZ9

If you want them to appear even more random, including last digits, you can pre-generate a pool of them as @user984869 suggested.

Please note the exception this snippet throws when the desired code length is not enough to contain both parts. It is inevitable if we want the length to be fixed. Fixed length also makes invitation codes with longer unique parts easier to guess because of shorter random parts.

That is why I would prefer a fixed length random part and dynamically growing unique part:

function generateUniqueHardToGuessCode($randomPartLength, $id)
{
    $allowedCharacters = '123456789ABCDEFGHJKLMNPQRSTUVWXYZ';
    // exclude 0, O and I from allowed characters to avoid confusion:
    // 0/O and I/l pairs can look very similar in some fonts like Arial
    $uniquePart = strtoupper(base_convert($id, 10, 33));
    // base_convert(.., .., 33) will return the string of the following
    // 33 characters: "0123456789abcdefghijklmnopqrstuvw"
    // "xyz" characters are left off to replace "0OI" characters
    // we want to exclude.
    $uniquePart = strtr($uniquePart, '0OI', 'XYZ');
    $randomPart = generateRandomString($randomPartLength, $allowedCharacters);
    return $randomPart . $uniquePart;
}

It makes invitation codes slowly grow in size as $id gets bigger, but throws no exceptions. It also saves an extra character by making the separator unnecessary.

vvzh
  • 391
  • 5
  • 8
1

You can use substr on a SHA1 or MD5. The chance of a collision with a substr'd hash is the same as a hash that's designed to be the shorter length.

Or if all you really want is to generate a unique key, you can do something like this:

define('KEY_CHARS', 'acefghjkpqrstwxyz23456789'); // characters which cannot be confused phonetically or by bad handwriting

function generateKey($len = 8) {
    $k = str_repeat('.', $len);
    while ($len--) {
        $k[$len] = substr(KEY_CHARS, mt_rand(0, strlen(KEY_CHARS) - 1), 1);
    }
    return $k;
}
Boann
  • 48,794
  • 16
  • 117
  • 146
  • @OliCharlesworth: It's random? Gotta be better than the old `md5(rand().uniqid().time())` code one sees thrown around. – Boann Dec 04 '11 at 16:45
  • 1
    @Boann: If you want unique, the only way to achieve this is to store details about previously-generated IDs. – Oliver Charlesworth Dec 04 '11 at 16:50
  • @Juhana @OliCharlesworth: Hmmm, okay. OP could add an incrementing integer to it each time he generates an invitation. E.g., `$invitationKey = generateKey() . '.' . (++$invitationNumber)`. Then it's both unguessable and unique. – Boann Dec 04 '11 at 16:52