26

I would like to generate a code like goo.gl and jsfiddle websites (http://jsfiddle.net/XzKvP/).

I tried different things that give me too large of a guid, a repeating alphanumeric code, etc.

I'm thinking I should be able to generate an alphanumeric code based on the Primary Key in my database table. This way it will be non-repeating? The PK is an auto-incremented integer by 1. But not sure that's how it should be done.

I want the code to look random, but it does NOT have to be. For example, I do NOT want item 1234 in my database to be BCDE and the 1235 item to be BCDF.

Examples:

Notice how the url http://jsfiddle.net/XzKvP/ has a unique 5 character code XzKvP associated to the page. I want to be able to generate the same type of code.

goo.gl does it too: http://goo.gl/UEhtg has UEhtg

How is this done?

capdragon
  • 14,565
  • 24
  • 107
  • 153
  • Have a read of the answer on this page: http://stackoverflow.com/questions/3193000/how-does-tiny-url-work – Ste Apr 24 '12 at 14:24
  • you want to generate a small random alphanumeric code that should be smaller than GUID? no other constrains? – Clueless Apr 24 '12 at 14:25
  • @Clueless: It should be non-repeating and the same length of the primary key in my database. – capdragon Apr 24 '12 at 14:26
  • @sch They may be okay if there aren't any better answers that meet my constraints. If a better answer that meets my constraints does come a long... they will get the credit. – capdragon Apr 24 '12 at 14:45
  • I like the answer given here: http://stackoverflow.com/a/1052896/1121833 – Chad Apr 24 '12 at 14:53
  • @ChadHenderson sounds promising. Can you create the C# version in an answer? – capdragon Apr 24 '12 at 14:56
  • @sch I wouldn't be here if I hadn't tried. This is me asking for help. – capdragon Apr 24 '12 at 15:21

3 Answers3

24

The solutions based on a random substring are no good because the outputs will collide. It may happen prematurely (with bad luck), and it will eventually happen when the list of generated values grows large. It doesn't even have to be that large for the probability of collisions to become high (see birthday attack).

What's good for this problem is a pseudo random permutation between the incrementing ID and its counterpart that will be shown in the URL. This technique guarantees that a collision is impossible, while still generating into an output space that is as small as the input space.

Implementation

I suggest this C# version of a Feistel cipher with 32 bits blocks, 3 rounds and a round function that is inspired by pseudo-random generators.

private static double RoundFunction(uint input)
{
    // Must be a function in the mathematical sense (x=y implies f(x)=f(y))
    // but it doesn't have to be reversible.
    // Must return a value between 0 and 1
    return ((1369 * input + 150889) % 714025) / 714025.0;
}

private static uint PermuteId(uint id)
{
    uint l1=(id>>16)&65535;
    uint r1=id&65535;
    uint l2, r2;
    for (int i = 0; i < 3; i++)
    {
        l2 = r1;
        r2 = l1 ^ (uint)(RoundFunction(r1) * 65535);
        l1 = l2;
        r1 = r2;
    }
    return ((r1 << 16) + l1);
}

To express the permuted ID in a base62 string:

private static string GenerateCode(uint id)
{
    return ToBase62(PermuteId(id));
}

The Base62 function is the same as the previous answer except that is takes uint instead of int (otherwise these functions would have to be rewritten to deal with negative values).

Customizing the algorithm

RoundFunction is the secret sauce of the algorithm. You may change it to a non-public version, possibly including a secret key. The Feistel network has two very nice properties:

  • even if the supplied RoundFunction is not reversible, the algorithm guarantees that PermuteId() will be a permutation in the mathematical sense (wich implies zero collision).

  • changing the expression inside the round function even lightly will change drastically the list of final output values.

Beware that putting something too trivial in the round expression would ruin the pseudo-random effect, although it would still work in terms of uniqueness of each PermuteId output. Also, an expression that wouldn't be a function in the mathematical sense would be incompatible with the algorithm, so for instance anything involving random() is not allowed.

Reversability

In its current form, the PermuteId function is its own inverse, which means that:

PermuteId(PermuteId(id))==id

So given a short string produced by the program, if you convert it back to uint with a FromBase62 function, and give that as input to PermuteId(), that will return the corresponding initial ID. That's pretty cool if you don't have a database to store the [internal-ID / shortstring] relationships: they don't actually need to be stored!

Producing even shorter strings

The range of the above function is 32 bits, that is about 4 billion values from 0 to 2^32-1. To express that range in base62, 6 characters are needed.

With only 5 characters, we could hope to represent at most 62^5 values, which is a bit under 1 billion. Should the output string be limited to 5 characters, the code should be tweaked as follows:

  • find N such that N is even and 2^N is as high as possible but lower than 62^5. That's 28, so our real output range that fits in 62^5 is going to be 2^28 or about 268 million values.

  • in PermuteId, use 28/2=14 bits values for l1 and r1 instead of 16 bits, while being careful to not ignore a single bit of the input (which must be less than 2^28).

  • multiply the result of RoundFunction by 16383 instead of 65535, to stay within the 14 bits range.

  • at the end of PermuteId, recombine r1 and l1 to form a 14+14=28 bits value instead of 32.

The same method could be applied for 4 characters, with an output range of 2^22, or about 4 million values.

What does it look like

In the version above, the first 10 produced strings starting with id=1 are:

cZ6ahF
3t5mM
xGNPN
dxwUdS
ej9SyV
cmbVG3
cOlRkc
bfCPOX
JDr8Q
eg7iuA

If I make a trivial change in the round function, that becomes:

ey0LlY
ddy0ak
dDw3wm
bVuNbg
bKGX22
c0s5GZ
dfNMSp
ZySqE
cxKH4b
dNqMDA
Community
  • 1
  • 1
Daniel Vérité
  • 58,074
  • 15
  • 129
  • 156
  • What happens when the 4 Billion values are reached? Does the output string grow? – capdragon Jul 11 '13 at 14:19
  • @capdragon: It can't grow beyond 2^32-1 values because that's `uint.MaxValue`. `PermuteId()` couldn't be called with a higher value. Now if you needed to exceed that, it should be adapted with `ulong` instead of `uint`, it's feasible but non-trivial. – Daniel Vérité Jul 11 '13 at 14:31
  • You mentioned in another comment that you had done this for postgress. I have since decided this is probably best generated on the database and have posted a separate question for it: http://stackoverflow.com/q/17596948/442580 – capdragon Jul 11 '13 at 15:07
10

You can think of the five-letter code as a number in base-62 notation: your "digits" are 26 lowercase and 26 uppercase letters, and digits from 0 to 9. (26+26+10) digits in total. Given a number from 0 to 62^5 (which equals 916132832) (say, your primary key) you can do the conversion to a five-digit base-62 as follows:

private static char Base62Digit(int d) {
    if (d < 26) {
        return (char)('a'+d);
    } else if (d < 52) {
        return (char)('A'+d-26);
    } else if (d < 62) {
        return (char)('0'+d-52);
    } else {
        throw new ArgumentException("d");
    }
}

static string ToBase62(int n) {
    var res = "";
    while (n != 0) {
        res = Base62Digit(n%62) + res;
        n /= 62;
    }
    return res;
}

private static int Base62Decode(char c) {
    if (c >= '0' && c <= '9') {
        return 52 + c - '0';
    } else if (c >= 'A' && c <= 'Z') {
        return 26 + c - 'A';
    } else if (c >= 'a' && c <= 'z') {
        return c - 'a';
    } else {
        throw new ArgumentException("c");
    }
}

static int FromBase62(string s) {
    return s.Aggregate(0, (current, c) => current*62 + Base62Decode(c));
}

Here is how to generate cryptographically strong random numbers (you need to add a reference to System.Security):

private static readonly RNGCryptoServiceProvider crypto =
    new RNGCryptoServiceProvider();

private static int NextRandom() {
    var buf = new byte[4];
    crypto.GetBytes(buf);
    return buf.Aggregate(0, (p, v) => (p << 8) + v) & 0x3FFFFFFF;
}
Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • How would i use this? with a function such as `GetCode()`? – capdragon Apr 24 '12 at 14:32
  • 1
    @capdragon: `string GetCode() {return ToBase62(this.PostId);}` – StriplingWarrior Apr 24 '12 at 14:34
  • @capdragon You would generate a numeric primary key, or take one from one of your database's tables, and *encode* it in base-62 (you cannot use off-the-shelf base-64 because it uses slash `/`). Call `ToBase62(id)` to do the encoding. – Sergey Kalinichenko Apr 24 '12 at 14:34
  • 2
    (+1) But are you sure this is how it's done? Because `ToBase62(1234)` returns t4 and `ToBase62(1235)` returns t5. It doesn't LOOK random like these other sites will generate a code that is totally different than the previous. – capdragon Apr 24 '12 at 14:43
  • @capdragon Other sites may be generating the numbers that go into their IDs out of sequence, and start their numbers at 62^4+1 to make sure there's at least five digits. You could use a random number generator for your IDs, and simply check if an ID that you've just generated already exists. With a good PRNG, you should rarely get a collision. – Sergey Kalinichenko Apr 24 '12 at 14:46
  • @dasblinkenlight I see... I'm scared once I have a million or more records I could be checking for duplicates for a while. Is this a valid concert with a 5 digit random character generator? – capdragon Apr 24 '12 at 14:54
  • @capdragon `62^4` equals `14776336` - the smallest number that requires that requires five nase-62 digits. It is similar to 10000 of the decimal system: you can express any positive number smaller than it using only four digits. – Sergey Kalinichenko Apr 24 '12 at 14:56
  • @capdragon You can easily load a million numbers into memory to have the check go faster. However, in order to stay efficient, you need a good random number generator. In .NET that could be [`RNGCryptoServiceProvider`](http://msdn.microsoft.com/en-us/library/system.security.cryptography.rngcryptoserviceprovider.aspx): you would use `GetBytes` to construct a sequence of random bytes, then combine them into a four-byte integer, and chop off the highest two bits. The values you'd get would be *cryptographically strong*, much harder to guess than what you get from `System.Random`. – Sergey Kalinichenko Apr 24 '12 at 15:05
  • Thanks for the example. I'm not convinced of doing the random then checking ifExists on the current values. I would think eventually you'll hit a point where the code generator will be looping for a while looking for a code that has not been used. Though I DO like your ToBase62 method which I might just end up going with unless someone can propose a better solution. – capdragon Apr 24 '12 at 15:37
  • You can do some kind of xor and permutations on your id before calling `ToBase62`. Do it in reverse to get your id back. – Nicolas Repiquet Apr 24 '12 at 15:37
  • @dasblinkenlight Is there a way to mix [this](http://stackoverflow.com/questions/1051949/map-incrementing-integer-range-to-six-digit-base-26-max-but-unpredictably/1052896#1052896) solution with yours so that the results won't be so predictable? – capdragon Apr 24 '12 at 15:37
  • @capdragon If you use `RNGCryptoServiceProvider`, results will not be predictable at all, and you will see a collision less than once in 200. – Sergey Kalinichenko Apr 24 '12 at 15:43
  • @dasblinkenlight: But don't you think still the `ToBase62` is a better solution because there will be NO collisions and NO need to keep all codes in memory to check if they exists? – capdragon Apr 24 '12 at 15:45
  • @capdragon The idea is to call `ToBase62` on the number returned by `NextRandom()`. If you simply shuffle bits in an autoincremented number, your shuffle will be susceptible to an attack that guesses your shuffle algorithm; this cannot be done to a sequence of cryptographically strong randoms. – Sergey Kalinichenko Apr 24 '12 at 15:50
  • @dasblinkenlight Well since i'm not concerned with a shuffle algorithm attack and I do NOT want to keep a collision tracker in memory. I'll go with your `ToBase62` example. I was just hoping i could make a code that looked random and of the same length as my PK. – capdragon Apr 24 '12 at 16:29
  • 2
    @dasblinkenlight If I'm using ToBase62, I might as well use my PK in the database. It's just as predictable, just shorter. I NEED something to **look Random without the use of a collision tracker**. Any other ideas? It does not have to be TRUELY random. It could just be shuffled. – capdragon Apr 24 '12 at 16:36
  • @dasblinkenlight Check out what I ended up doing: http://stackoverflow.com/a/10302978/442580 – capdragon Apr 24 '12 at 17:31
  • Is it a good practice to ditch the numeric ID altogether, and only store the **Base 62** string as the **primary key** in the database? – Dor Nov 09 '12 at 16:57
3

This is what I ended up doing

(Updated since Daniel Vérité's answer):

class Program
{

    private static double RoundFunction(uint input)
    {
        // Must be a function in the mathematical sense (x=y implies f(x)=f(y))
        // but it doesn't have to be reversible.
        // Must return a value between 0 and 1
        return ((1369 * input + 150889) % 714025) / 714025.0;
    }
    private static char Base62Digit(uint d)
    {
        if (d < 26)
        {
            return (char)('a' + d);
        }
        else if (d < 52)
        {
            return (char)('A' + d - 26);
        }
        else if (d < 62)
        {
            return (char)('0' + d - 52);
        }
        else
        {
            throw new ArgumentException("d");
        }
    }
    private static string ToBase62(uint n)
    {
        var res = "";
        while (n != 0)
        {
            res = Base62Digit(n % 62) + res;
            n /= 62;
        }
        return res;
    }
    private static uint PermuteId(uint id)
    {
        uint l1 = (id >> 16) & 65535;
        uint r1 = id & 65535;
        uint l2, r2;
        for (int i = 0; i < 3; i++)
        {
            l2 = r1;
            r2 = l1 ^ (uint)(RoundFunction(r1) * 65535);
            l1 = l2;
            r1 = r2;
        }
        return ((r1 << 16) + l1);
    }


    private static string GenerateCode(uint id)
    {
        return ToBase62(PermuteId(id));
    }

    static void Main(string[] args)
    {

        Console.WriteLine("testing...");

            try
            {

                for (uint x = 1; x < 1000000; x += 1)
                {
                    Console.Write(GenerateCode(x) + ",");

                }

            }
            catch (Exception err)
            {
                Console.WriteLine("error: " + err.Message);
            }

        Console.WriteLine("");
        Console.WriteLine("Press 'Enter' to continue...");
        Console.Read();
    }
}
capdragon
  • 14,565
  • 24
  • 107
  • 153
  • I'm glad that it works for you. I'm curious how do you get the ID back from the string, or was it even a requirement (I assumed that it was). – Sergey Kalinichenko Apr 24 '12 at 17:44
  • @capdragon: the unique ID, as it is used, doesn't prevent collisions. I believe this generator will start to create collisions with a real serious probability after a few million outputs. In fact even the 2nd generated string could collide with the 1st given an extremely unlucky random and shuffle combination. – Daniel Vérité Jul 09 '13 at 13:48
  • 1
    @DanielVérité Thanks for pointing this out, I believe you are right. Would you happen to have a solution? – capdragon Jul 10 '13 at 15:44
  • @capdragon: yes, I have researched this before and came up with a function in plpgsql (postgres) that could be ported to C#. I'll give it a shot and update shortly. – Daniel Vérité Jul 10 '13 at 16:35
  • @capdragon: ok, I've submitted a detailed answer! – Daniel Vérité Jul 11 '13 at 11:50
  • if you want something deterministic you can run the ID through a hash function instead of using a random number. – Justin L. Jul 11 '13 at 11:53
  • 2
    @JustinL.: a hash function produces collisions, which is unacceptable here. Or if you have a collision-less hash function that fits the question, please submit an answer. – Daniel Vérité Jul 11 '13 at 12:09
  • @DanielVérité sorry, submitted comment to wrong answer; thought this was the answer where they appended a base62 representation at the end of a random number – Justin L. Jul 11 '13 at 17:18