3

I'm trying to generate some tokens. They have to be 26 characters along from the alphabet [a-z0-9].

The closest solution I've found is part 2 from this answer, but the string won't be uniformly distributed.

If my alphabet was a power of 2 in length, this wouldn't be so hard, but as it stands, I'm not sure how to do this properly.


Specifically, this is what I've got so far:

export function createSessionId() {
    const len = 26;
    let bytes = new Crypto.randomBytes(len);
    let result = new Array(len);
    const alphabet = 'abcdefghijklmnopqrstuvwxyz0123456789';
    for(let i=0; i<len; ++i) {
        result[i] = alphabet[bytes[i]%alphabet.length];
    }
    return result.join('');
}

But I'm pretty sure that won't be distributed properly because of the modulo.

mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • What about a) using a random float and multiplying with 36 or b) random(64) and ignoring all results above 35? (any by random i mean using the random bytes to interpret them as float) – Psi Jun 15 '17 at 21:21
  • @Psi Would those be 'secure'? I could do that. – mpen Jun 15 '17 at 21:23
  • If you included uppercase letters and 2 extraneous characters in the mix, you'd have a power of 2. Do the tokens absolutely have to be base36? – Patrick Roberts Jun 15 '17 at 21:23
  • 1
    Well, the distribution in both cases would be nearly even. Better than modulo 36 at least. A perfect distribution would be hard because of the discrete nature even of a float representation, but using that approach, you would be very close to chance i bet. The second solution will be evenly distributed, but after you throw away your result nearly 44% of the time, your runtime behavior would be bad as hell, especially in unlucky cases. The best way to find out is benchmarking. – Psi Jun 15 '17 at 21:26
  • @PatrickRoberts I'm interopping with an existing system. That's the alphabet they use. I'd like to stick to the existing conventions. Not sure if changing it would break anything or not. – mpen Jun 15 '17 at 21:30
  • @Psi based on my calculations, you'd only need to discard 1.5% of the values...? `(256 - floor(256 / 36) * 36) / 256` is 0.015625 – Patrick Roberts Jun 15 '17 at 21:30
  • Using modulo 64, you can only use 36 of them, you need to discard 44%. I did not calculate for 128 or even 256. Maybe using a higher number is slightly better, so first using a modulo 252 out of 256 is of course better. – Psi Jun 15 '17 at 21:33

2 Answers2

2

Let's see what you got here:

An alphabet of 36 characters, where each character is randomly selected at once by using modulo 36, is indeed not evenly distributed.

You have several options:

  1. Select a whole series of bytes to represent the total string. That means, you need to select enough bytes to represent a number in the range of 0 - 36^26. That way, your value will be evenly distributed (as far as the crypto provider allows for).

  2. If you insist on choosing one digit per time, you want to make sure its value will be distributed evenly, using modulo 36 does not do the job, as you correctly assumed. In this case, you can either

  • a) interpret 8 bytes as a float and multiply the result with 36
  • b) modulo by a power of 2 greater than 36. Then, search for the greatest multiple of 36 lower than that power of two, discard any value outside of that value space and modulo 36 the valid values.

For 2a), the distribution is not perfectly even, but very close to chance, assumed that the crypto provider is fair.

For 2b), the distribution is even. But this will be paid by a higher (and especially unpredictable) runtime when throwing away unneeded results. Of course you can statistically calculate the runtime, but the worst case is an infinite one, if your RNG keeps producing invalid results forever (which is very, very unlikely, but theoretically possible).

My advice is 2a). Take a series of bytes, interpret them as a float value and multiply the result with 36.

Psi
  • 6,387
  • 3
  • 16
  • 26
  • For 2b, do I need to choose a power of 2 or a multiple of 36? e.g. can't I discard values >= 252 and then % 36? That would give exactly 7 ways to roll each number, 0-35, no? – mpen Jun 15 '17 at 21:49
  • 1
    First, you take the value of one byte. If it is > 251 => discard. After that, you have a value which is in the range 0-251 (which is a perfect multiple of 36) and you can do a modulo 36. It's nearly the same as using dice to render an even distribution on a value space of 4 values. If a die shows a 5 or 6, discard the value, the rest will be evenly distributed. – Psi Jun 15 '17 at 21:50
  • Right, that's what I thought. Your answer says "power of 2" but neither 36 nor 252 are powers of 2. – mpen Jun 15 '17 at 21:52
  • 256 is a power of two. Maybe I was unclear, going to clear out that point – Psi Jun 15 '17 at 21:52
1

Here's my implementation of Psi's answer, 2b.

export function createSessionId() {
    const len = 26;
    let bytes = Crypto.randomBytes(30); // a few extra in case we're unlucky
    let result = new Array(len);
    const alphabet = 'abcdefghijklmnopqrstuvwxyz0123456789';
    let i =0;
    let j = 0;

    for(;;) {
        if(i >= bytes.length) {
            bytes = Crypto.randomBytes(((len-j)*1.2)|0); // we got unlucky, gather up some more entropy
            i = 0;
        }
        let value = bytes[i++];
        if(value >= 252) { // we need a multiple of 36 for an even distribution
            continue;
        }
        result[j++] = alphabet[value % alphabet.length];
        if(j >= len) {
            break;
        }
    }
    return result.join('');
}

There's less than a 2% chance of needing to reroll (4/255), so I think this ought to be efficient enough.

Hard to unit test something like this, but this passes:

test(createSessionId.name, () => {
    let ids = new Set();
    let dist = Array.apply(null,new Array(26)).map(() => ({}));
    for(let i=0; i<1500; ++i) {
        let id = createSessionId();
        if(ids.has(id)) {
            throw new Error(`Not unique`);
        }
        ids.add(id);
        for(let j=0; j<id.length; ++j) {
            dist[j][id[j]] = true;
        }
    }
    for(let i=0; i<26; ++i) {
        expect(Object.keys(dist[i]).length).toEqual(36);
    }
});
mpen
  • 272,448
  • 266
  • 850
  • 1,236
  • Statistically, you're right, but the worst case is an infinite series of values above 251 where you are re-rolling forever. Although the probability for that is infinitesimally small, it is not 0. – Psi Jun 23 '17 at 01:33
  • @Psi Isn't it though? [Wolfram|Alpha says](http://www.wolframalpha.com/input/?i=x%2Finfinity) `x/infinity` *is* 0. Also, in 1 million runs of the function, it only had to reroll 101 times (meaning it took more than 30 bytes). i.e., there's a 0.01% chance of having to generate a few extra bytes. Isn't 4 or 5 extra iterations better than generating an uneven distribution with floats? Also, the float math is probably more expensive anyway. – mpen Jun 23 '17 at 20:43
  • You should read about an [infinitesimal](https://en.wikipedia.org/wiki/Infinitesimal). Infinity is a concept, not a number, so dividing by infinity makes nearly as much sense as dividing by zero. Nonetheless, even said that the chance actually _was_ zero, the chance for re-rolling again over a period of, say, one year, in fact (and you can't deny it) _is_ strictly above zero. It is very very unlikely to happen, but theoretically possible. And I'm just talking about the _worst case scenario_, not the expected average runtime. Your algorithm may run totally fine a billion of times. – Psi Jun 23 '17 at 22:08
  • 1
    The next thing you should consider is calculating the probability that in a series of, say, 1 million runs, no delay longer than a given maximum of re-rolls _x_ is likely to occur. You will find that this value is much higher than expected due to the [birthday paradox](https://en.wikipedia.org/wiki/Birthday_problem). So a, let's call it _bad case_ or even _very bad case_ becomes more probable the more often your algorithm gets executed because it is unlikely that your number of needed re-rolls always will be below the threshold _x_ the higher the number of executions is. – Psi Jun 23 '17 at 22:19