130
  • Need to handle > 1000 but < 10000 new records per day

  • Cannot use GUID/UUIDs, auto increment numbers etc.

  • Ideally should be 5 or 6 chars long, can be alpha of course

  • Would like to reuse existing, well-known algos, if available

Anything out there ?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Kumar
  • 10,997
  • 13
  • 84
  • 134
  • Why not use an INT or BIGINT that is autoincremented? It is probably the most readable and can easily handle the volume. – Malk Mar 03 '12 at 05:32
  • per the Q above, trying to keep it to 5/6 chars max and support upto 9999 new records a day – Kumar Mar 03 '12 at 05:37
  • @Kumar - What if you need to more than 9999 records in one day? Your proposed solution does not sound tenable. – ChaosPandion Mar 03 '12 at 05:39
  • @ChaosPandion: I think these are probably rough guesses of load/traffic rather than hard bounds. I'm not sure why you'd want to set an arbitrary cap on the number of daily transactions. – Paul Sasik Mar 03 '12 at 05:42
  • You could encode it to base 64 and use that. I am not sure you could reduce it smaller than that and still use readable characters. But I would argue that base 64 is far less readable than base 32 because it requires adding an extra qualifier to most characters (capital f, lower o, lower o versus just f, o o). – Malk Mar 03 '12 at 05:44

5 Answers5

173

Base 62 is used by tinyurl and bit.ly for the abbreviated URLs. It's a well-understood method for creating "unique", human-readable IDs. Of course you will have to store the created IDs and check for duplicates on creation to ensure uniqueness. (See code at bottom of answer)

Base 62 uniqueness metrics

5 chars in base 62 will give you 62^5 unique IDs = 916,132,832 (~1 billion) At 10k IDs per day you will be ok for 91k+ days

6 chars in base 62 will give you 62^6 unique IDs = 56,800,235,584 (56+ billion) At 10k IDs per day you will be ok for 5+ million days

Base 36 uniqueness metrics

6 chars will give you 36^6 unique IDs = 2,176,782,336 (2+ billion)

7 chars will give you 36^7 unique IDs = 78,364,164,096 (78+ billion)

Code:

public void TestRandomIdGenerator()
{
    // create five IDs of six, base 62 characters
    for (int i=0; i<5; i++) Console.WriteLine(RandomIdGenerator.GetBase62(6));

    // create five IDs of eight base 36 characters
    for (int i=0; i<5; i++) Console.WriteLine(RandomIdGenerator.GetBase36(8));
}

public static class RandomIdGenerator 
{
    private static char[] _base62chars = 
        "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
        .ToCharArray();

    private static Random _random = new Random();

    public static string GetBase62(int length) 
    {
        var sb = new StringBuilder(length);

        for (int i=0; i<length; i++) 
            sb.Append(_base62chars[_random.Next(62)]);

        return sb.ToString();
    }       

    public static string GetBase36(int length) 
    {
        var sb = new StringBuilder(length);

        for (int i=0; i<length; i++) 
            sb.Append(_base62chars[_random.Next(36)]);

        return sb.ToString();
    }
}

Output:

z5KyMg
wd4SUp
uSzQtH
UPrGAT
UIf2IS

QCF9GNM5
0UV3TFSS
3MG91VKP
7NTRF10T
AJK3AJU7
Paul Sasik
  • 79,492
  • 20
  • 149
  • 189
  • 3
    looks fantastic, anything that's not case sensitive ? – Kumar Mar 03 '12 at 05:41
  • 3
    If you want to avoid case sensitive you could use base 36: http://www.codeproject.com/Articles/10619/Base-36-type-for-NET-C but to get that many permutations as base 62 you would need to use more characters in your ID. It's a trade-off. Or you could try to use other characters besides alpha, but that gets ugly for users. – Paul Sasik Mar 03 '12 at 05:44
  • thanks, from your answer i came across base36 as well, now to find a sql server implementation of it ! – Kumar Mar 03 '12 at 05:50
  • 2
    here http://stackoverflow.com/questions/9543892/any-sql-server-proc-for-base36 & many thanks – Kumar Mar 03 '12 at 05:56
  • 25
    One thought. Perhaps take out the vowels to prevent the accidental generation of swear words. Especially if it's public facing. – Damien Sawyer Oct 30 '15 at 11:58
  • 11
    Depending on where you are using this (particularly if humans are expected to read and re-enter the codes) you may want to consider removing oft-confused characters from consideration: 0/O and I/l/1. This can be mitigated in some cases by good font choice, but I can't tell from the question whether the OP will have control over that. – GrandOpener Apr 19 '16 at 20:16
  • 1
    It may be worth pointing out that the UTF-16 Unicode encoding for an 8 character identifier occupies exactly the same amount of space as a Version 1 UUID represented as a byte array and stored in a database as BINARY(16). If the purpose is to have a unique ID, a version 1 UUID will be more index friendly and it addresses the potential pitfalls of the pseudo-random number generator. – scottb Sep 01 '16 at 14:43
  • 1
    @DamienSawyer +1 That's interesting. Although I bet there's still a bunch of other "colourful" combinations. SH1T, P3N1S, T1TS, B00BS etc. Maybe a profanity filter would be a better choice if the url could cause embarrassment on, for example, a .gov website. – Lee Gunn Dec 14 '16 at 11:44
  • Good point @lee-gunn. I can't help thinking that your list would make for great license plates :-) – Damien Sawyer Dec 14 '16 at 21:09
  • Can you please compare your answer to the answer here https://stackoverflow.com/a/29372036/54964 Do you have any salts? How do you compare anonymity of your code in your answer to the other answer? – Léo Léopold Hertz 준영 May 28 '17 at 07:59
  • Can I use this logic to generate correlation/request Ids? Even if they start repeating after a couple of months, I am good. I know, these won't repeat so fast. I am using java and yes, need case-insensitive ids. – User3518958 Aug 14 '20 at 06:35
  • 3
    As we create more ids, the DB check will find more & more duplicates, resulting in slow performance. So this approach doesn't scale in the long term. – Arun Avanathan Feb 12 '21 at 22:51
  • @ArunAvanathan I don't think short IDs like this are logically suitable for long term storage. They're not a replacement for incrementing integer or UUID or some other cryptographically unique value. These are more suitable for applications where the ID is used by a human, and the record can and should be expired and deleted imo. Like a url shortener or notification system where the notification can be deleted once dismissed by the user. – Sinaesthetic Oct 08 '21 at 17:54
25

I recommend http://hashids.org/ which converts any number (e.g. DB ID) into a string (using salt).

It allows decoding this string back to the number. So you don't need to store it in the database.

Has libs for JavaScript, Ruby, Python, Java, Scala, PHP, Perl, Swift, Clojure, Objective-C, C, C++11, Go, Erlang, Lua, Elixir, ColdFusion, Groovy, Kotlin, Nim, VBA, CoffeeScript and for Node.js & .NET.

Slawa
  • 1,141
  • 15
  • 21
  • 1
    Can you provide any other options similary your proposal? - - It is very interesting. I would like to know if there is any default options like that in PostgreSQL. – Léo Léopold Hertz 준영 May 28 '17 at 07:58
  • 1
    Here is [.NET](https://github.com/ullmark/hashids.net) version of it, but can you explain how does it work without need to store it in the database ? Can I generate just unique randoms without giving numbers as input and without a salt ? – Shaiju T Dec 27 '17 at 11:01
  • @Slawa I need something like hashids for .NET but the final hash will be stored in the db in a column with fixed length, is that possible to say always generate hash with a maximum length of N? – Anon Dev Feb 07 '18 at 21:54
8

I had similar requirements as the OP. I looked into available libraries but most of them are based on randomness and I didn't want that. I could not really find anything that was not based on random and still very short... So I ended up rolling my own based on the technique Flickr uses, but modified to require less coordination and allow for longer periods offline.

In short:

  • A central server issues ID blocks consisting of 32 IDs each
  • The local ID generator maintains a pool of ID blocks to generate an ID every time one is requested. When the pool runs low it fetches more ID blocks from the server to fill it up again.

Disadvantages:

  • Requires central coordination
  • IDs are more or less predictable (less so than regular DB ids but they aren't random)

Advantages

  • Stays within 53 bits (Javascript / PHP max size for integer numbers)
  • very short IDs
  • Base 36 encoded so very easy for humans to read, write and pronounce
  • IDs can be generated locally for a very long time before needing contact with the server again (depending on pool settings)
  • Theoretically no chance of collissions

I have published both a Javascript library for the client side, as well as a Java EE server implementation. Implementing servers in other languages should be easy as well.

Here are the projects:

suid - Distributed Service-Unique IDs that are short and sweet

suid-server-java - Suid-server implementation for the Java EE technology stack.

Both libraries are available under a liberal Creative Commons open source license. Hoping this may help someone else looking for short unique IDs.

Stijn de Witt
  • 40,192
  • 13
  • 79
  • 80
  • Can you please compare the https://stackoverflow.com/a/29372036/54964 to your proposal `suid`? – Léo Léopold Hertz 준영 May 28 '17 at 08:00
  • 2
    It is based on random numbers. It is pretty great, actually. But your IDs wont be as short as they can be. I wrote SUID to start numbering at 1 so you will start out with *extremely short* IDs. Think 3 or 4 characters. Plus, it has some other nice advantages to have (roughly) incrementally ordered IDs, apart from starting with the really short ones. – Stijn de Witt May 30 '17 at 08:56
4

I used base 36 when I solved this problem for an application I was developing a couple of years back. I needed to generate a human readable reasonably unique number (within the current calendar year anyway). I chose to use the time in milliseconds from midnight on Jan 1st of the current year (so each year, the timestamps could duplicate) and convert it to a base 36 number. If the system being developed ran into a fatal issue it generated the base 36 number (7 chars) that was displayed to an end user via the web interface who could then relay the issue encountered (and the number) to a tech support person (who could then use it to find the point in the logs where the stacktrace started). A number like 56af42g7 is infinitely easier for a user to read and relay than a timestamp like 2016-01-21T15:34:29.933-08:00 or a random UUID like 5f0d3e0c-da96-11e5-b5d2-0a1d41d68578.

Warren Smith
  • 41
  • 1
  • 1
1

I really like the simplicity of just encoding a GUID using Base64 format and truncating the trailing == to get a string of 22 characters (it takes one line of code, and you can always convert it back to GUID). Sadly, it sometimes includes + and / characters. OK for database, not great for URLs, but it helped me appreciate the other answers :-)

From https://www.codeproject.com/Tips/1236704/Reducing-the-string-Length-of-a-Guid by Christiaan van Bergen

We found that converting the Guid (16 bytes) to an ASCII representation using Base64 resulted in a useable and still unique messageID of only 22 characters.

var newGuid = Guid.NewGuid();
var messageID = Convert.ToBase64String(newGuid.ToByteArray());

var message22chars = Convert.ToBase64String(Guid.NewGuid().ToByteArray()).Substring(0,22);

For example: The Guid 'e6248889-2a12-405a-b06d-9695b82c0a9c' (string length: 36) will get a Base64 representation: 'iYgk5hIqWkCwbZaVuCwKnA==' (string length: 24)

The Base64 representation ends with the '==' characters. You could just truncate these, without any impact on the uniqueness. Leaving you with an identifier of only 22 characters in length.

Ekus
  • 1,679
  • 21
  • 17
  • `Sadly, it sometimes includes + and / characters` - so use the Base64Url variant instead (RFC 4648). – Ian Goldby Apr 30 '21 at 09:38
  • @IanGoldby Base62 doesn't have that "problem". – Bouke Jun 13 '22 at 13:40
  • @Bouke You seem to miss my point. The answer above will occasionally generate IDs containing "/" and "+" (as already acknowledged). The simplest solution is to use the URL-safe base 64 variant - readily available in most libraries so it's no effort and there is literally no downside. Knowing this, why would anyone implement their own 62-character alphabet (which will generate slightly longer IDs too)? – Ian Goldby Jun 14 '22 at 07:26
  • 1
    @IanGoldby Indeed, if the only goal is to be URL-safe then RFC 4648 is perfect. Base64 is also a nicer encoding in that the encoding is simpeler and easier to reason about like length. What I don't like about Base64 is that +/ or -_ are word-boundaries and text selection is easier with Base62 at the cost of a more complex encoding scheme. – Bouke Jun 14 '22 at 09:28