31

I've seen a bunch of different solutions on StackOverflow that span many years and many Postgres versions, but with some of the newer features like gen_random_bytes I want to ask again to see if there is a simpler solution in newer versions.

Given IDs which contain a-zA-Z0-9, and vary in size depending on where they're used, like...

bTFTxFDPPq
tcgHAdW3BD
IIo11r9J0D
FUW5I8iCiS

uXolWvg49Co5EfCo
LOscuAZu37yV84Sa
YyrbwLTRDb01TmyE
HoQk3a6atGWRMCSA

HwHSZgGRStDMwnNXHk3FmLDEbWAHE1Q9
qgpDcrNSMg87ngwcXTaZ9iImoUmXhSAv
RVZjqdKvtoafLi1O5HlvlpJoKzGeKJYS
3Rls4DjWxJaLfIJyXIEpcjWuh51aHHtK

(Like the IDs that Stripe uses.)

How can you generate them randomly and safely (as far as reducing collisions and reducing predictability goes) with an easy way to specify different lengths for different use cases, in Postgres 9.6+?

I'm thinking that ideally the solution has a signature similar to:

generate_uid(size integer) returns text

Where size is customizable depending on your own tradeoffs for lowering the chance of collisions vs. reducing the string size for usability.

From what I can tell, it must use gen_random_bytes() instead of random() for true randomness, to reduce the chance that they can be guessed.

Thanks!


I know there's gen_random_uuid() for UUIDs, but I don't want to use them in this case. I'm looking for something that gives me IDs similar to what Stripe (or others) use, that look like: "id": "ch_19iRv22eZvKYlo2CAxkjuHxZ" that are as short as possible while still containing only alphanumeric characters.

This requirement is also why encode(gen_random_bytes(), 'hex') isn't quite right for this case, since it reduces the character set and thus forces me to increase the length of the strings to avoid collisions.

I'm currently doing this in the application layer, but I'm looking to move it into the database layer to reduce interdependencies. Here's what the Node.js code for doing it in the application layer might look like:

var crypto = require('crypto');
var set = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';

function generate(length) {
  var bytes = crypto.randomBytes(length);
  var chars = [];

  for (var i = 0; i < bytes.length; i++) {
    chars.push(set[bytes[i] % set.length]);
  }

  return chars.join('');
}
Community
  • 1
  • 1
Ian Storm Taylor
  • 8,520
  • 12
  • 55
  • 72
  • What's the range of `N`? – Bohemian Feb 01 '17 at 06:49
  • http://stackoverflow.com/q/40006558/330315 or http://stackoverflow.com/q/19530736/330315 –  Feb 01 '17 at 07:31
  • @IanStorm. I answered this question because I see it a lot. But, Actually I'm of the mindset that it shouldn't be here with the term "unique identifier". If you want gibberish you can have it, by all means. But identiifer and not UUID is pretty silly, imho. That's what it's for. – Evan Carroll Feb 01 '17 at 07:35
  • As a useful suggestion, I would suggest you make this question stand out by removing all mentions of "unique identifier" and replacing them with "set that merely looks unique in appearance." – Evan Carroll Feb 01 '17 at 08:04
  • 1
    Thanks @EvanCarroll! I use the term "identifier" because that's my use case, but more importantly because I think it connotes the security necessary—the result should not be predictable, similar to how using `SERIAL` would not work for this scenario. I get that UUIDs are made for this, but I'd like a bit more control over the output length and "look" as far as characters used go—similar to how Youtube or others do for short URL codes. – Ian Storm Taylor Feb 01 '17 at 17:17
  • That's not how YouTube works. They use HashIDS which return 1:1 with serial. You can do that too in PostgreSQL. What you're doing is going to sting you. – Evan Carroll Feb 01 '17 at 17:36
  • Okay fine, forget Youtube. Take Stripe as an example instead, whose IDs look like `"id": "ch_19iRv22eZvKYlo2CAxkjuHxZ"`. I only use "shortcodes" as context to convey what kind of aesthetic I'm going for. – Ian Storm Taylor Feb 01 '17 at 17:44
  • @EvanCarroll Could you provide an example of how Ian's wanted approach could sting? – kevlarr Jun 15 '18 at 04:05
  • 1
    @kevlarr If `62**10` is ever not enough entropy. That's what Ian is doing. He's storing 10 bytes in 14 bytes of storage for `62**10` bits of entropy. When he could have `2**128 bits` in 16 bytes (substantially less chance of a collision, as a standard it's how you do this), or he could use salted hashcats which has a 0-chance of collision and returns a smaller key – Evan Carroll Jun 15 '18 at 05:01

6 Answers6

27

Figured this out, here's a function that does it:

CREATE OR REPLACE FUNCTION generate_uid(size INT) RETURNS TEXT AS $$
DECLARE
  characters TEXT := 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
  bytes BYTEA := gen_random_bytes(size);
  l INT := length(characters);
  i INT := 0;
  output TEXT := '';
BEGIN
  WHILE i < size LOOP
    output := output || substr(characters, get_byte(bytes, i) % l + 1, 1);
    i := i + 1;
  END LOOP;
  RETURN output;
END;
$$ LANGUAGE plpgsql VOLATILE;

And then to run it simply do:

generate_uid(10)
-- '3Rls4DjWxJ'

Warning

When doing this you need to be sure that the length of the IDs you are creating is sufficient to avoid collisions over time as the number of objects you've created grows, which can be counter-intuitive because of the Birthday Paradox. So you will likely want a length greater (or much greater) than 10 for any reasonably commonly created object, I just used 10 as a simple example.


Usage

With the function defined, you can use it in a table definition, like so:

CREATE TABLE users (
  id TEXT PRIMARY KEY DEFAULT generate_uid(10),
  name TEXT NOT NULL,
  ...
);

And then when inserting data, like so:

INSERT INTO users (name) VALUES ('ian');
INSERT INTO users (name) VALUES ('victor');
SELECT * FROM users;

It will automatically generate the id values:

    id     |  name  | ...
-----------+--------+-----
owmCAx552Q | ian    |
ZIofD6l3X9 | victor |

Usage with a Prefix

Or maybe you want to add a prefix for convenience when looking at a single ID in the logs or in your debugger (similar to how Stripe does it), like so:

CREATE TABLE users (
  id TEXT PRIMARY KEY DEFAULT ('user_' || generate_uid(10)),
  name TEXT NOT NULL,
  ...
);

INSERT INTO users (name) VALUES ('ian');
INSERT INTO users (name) VALUES ('victor');
SELECT * FROM users;

      id       |  name  | ...
---------------+--------+-----
user_wABNZRD5Zk | ian    |
user_ISzGcTVj8f | victor |
Ian Storm Taylor
  • 8,520
  • 12
  • 55
  • 72
  • This is great, thank you Ian! — Have you used random strings as primary keys without issue? Or are there other gotchas to watch out for? – vpontis May 05 '20 at 00:16
  • How to make sure `gen_random_bytes(size);` is unique; – hjl May 10 '20 at 18:21
  • If you use this to define a default value, are concurrent inserts safe? – kcstricks Sep 15 '20 at 04:00
  • @kcstricks yes, the variables are local to each function invocation – jbg Jun 24 '21 at 07:00
  • @hjl if `size` is small and you're worried about the chance of collision, make sure your column is `PRIMARY KEY` or `UNIQUE` and then you can retry (thus generating a new ID) if you get a duplicate key error. if `size` is large enough it's not going to happen in your (or your program's, or the universe's) lifetime though. – jbg Jun 24 '21 at 07:01
  • 1
    This solution isn't as secure as it appears because it has a [modulo bias](https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/). – Roy Paterson Sep 01 '21 at 13:25
8

I'm looking for something that gives me "shortcodes" (similar to what Youtube uses for video IDs) that are as short as possible while still containing only alphanumeric characters.

This is a fundamentally different question from what you first asked. What you want here then is to put a serial type on the table, and to use hashids.org code for PostgreSQL.

  • This returns 1:1 with the unique number (serial)
  • Never repeats or has a chance of collision.
  • Also base62 [a-zA-Z0-9]

Code looks like this,

SELECT id, hash_encode(foo.id)
FROM foo; -- Result: jNl for 1001

SELECT hash_decode('jNl') -- returns 1001

This module also supports salts.

Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
5

Review,

  1. 26 characters in [a-z]
  2. 26 characters in [A-Z]
  3. 10 characters in [0-9]
  4. 62 characters in [a-zA-Z0-9] (base62)
  5. The function substring(string [from int] [for int]) looks useful.

So it looks something like this. First we demonstrate that we can take the random-range and pull from it.

SELECT substring(
  'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
  1, -- 1 is 'a', 62 is '9'
  1,
);

Now we need a range between 1 and 63

SELECT trunc(random()*62+1)::int+1
FROM generate_series(1,1e2) AS gs(x)

This gets us there.. Now we just have to join the two..

SELECT substring(
  'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
  trunc(random()*62)::int+1
  1
)
FROM generate_series(1,1e2) AS gs(x);

Then we wrap it in an ARRAY constructor (because this is fast)

SELECT ARRAY(
  SELECT substring(
    'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
    trunc(random()*62)::int+1,
    1
  )
  FROM generate_series(1,1e2) AS gs(x)
);

And, we call array_to_string() to get a text.

SELECT array_to_string(
  ARRAY(
      SELECT substring(
        'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
        trunc(random()*62)::int+1,
        1
      )
      FROM generate_series(1,1e2) AS gs(x)
  )
  , ''
);

From here we can even turn it into a function..

CREATE FUNCTION random_string(randomLength int)
RETURNS text AS $$
SELECT array_to_string(
  ARRAY(
      SELECT substring(
        'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789',
        trunc(random()*62)::int+1,
        1
      )
      FROM generate_series(1,randomLength) AS gs(x)
  )
  , ''
)
$$ LANGUAGE SQL
RETURNS NULL ON NULL INPUT
VOLATILE LEAKPROOF;

and then

SELECT * FROM random_string(10);
Community
  • 1
  • 1
Evan Carroll
  • 78,363
  • 46
  • 261
  • 468
  • Hey Evan, is that "safe" to use as far as collision avoidance and predictability avoidance go seeing as it's using `random()` instead of something like `gen_random_bytes()`? (I realize that collision avoidance is a factor of length.) Looking at the PG docs it says: `The characteristics of the values returned by random() depend on the system implementation. It is not suitable for cryptographic applications; see pgcrypto module for an alternative.` – Ian Storm Taylor Feb 01 '17 at 17:10
  • No, it absolutely is not safe to use in any way shape or form which is why I'm for closing this question. If you want something that's safe to use, use UUID. If you want to play around with something likely to burn you *severely* and leave you crying. A solution which requires you create your own function that is worse in every way than the stock feature set to do this, then have it at. =) – Evan Carroll Feb 01 '17 at 17:17
  • Seriously, UUID is great. It's written in C, it's faster to run, it stores more efficiently (less space), and it has far more random. You have `64**length` bits of random here, in UUID you have `2**128` bits of random. Your string would have to be larger 22 characters or greater to store more random than UUID, and at that point it's already 10 bytes larger (40%) less efficient. – Evan Carroll Feb 01 '17 at 17:23
  • 1
    `trunc(random()*62+1)::int + 1` will never return a 1 (i.e. `'a'`). The correct expression for the range `[1-62]` is `(random() * 61)::int + 1` which also saves a function call. If you must use trunc, `trunc(random()*62)::int + 1` is ok. With `round`, the number must change: `round(random()*61)::int + 1` – Amit Naidu May 21 '19 at 21:07
3

Thanks to Evan Carroll answer, I took a look on hashids.org. For Postgres you have to compile the extension or run some TSQL functions. But for my needs, I created something simpler based on hashids ideas (short, unguessable, unique, custom alphabet, avoid curse words).

Shuffle alphabet:

CREATE OR REPLACE FUNCTION consistent_shuffle(alphabet TEXT, salt TEXT) RETURNS TEXT AS $$
DECLARE
    SALT_LENGTH INT := length(salt);
    integer INT = 0;
    temp TEXT = '';
    j INT = 0;
    v INT := 0;
    p INT := 0;
    i INT := length(alphabet) - 1;
    output TEXT := alphabet;
BEGIN
    IF salt IS NULL OR length(LTRIM(RTRIM(salt))) = 0 THEN
        RETURN alphabet;
    END IF;
    WHILE i > 0 LOOP
        v := v % SALT_LENGTH;
        integer := ASCII(substr(salt, v + 1, 1));
        p := p + integer;
        j := (integer + v + p) % i;

        temp := substr(output, j + 1, 1);
        output := substr(output, 1, j) || substr(output, i + 1, 1) || substr(output, j + 2);
        output := substr(output, 1, i) || temp || substr(output, i + 2);

        i := i - 1;
        v := v + 1;
    END LOOP;
    RETURN output;
END;
$$ LANGUAGE plpgsql VOLATILE;

The main function:

CREATE OR REPLACE FUNCTION generate_uid(id INT, min_length INT, salt TEXT) RETURNS TEXT AS $$
DECLARE
    clean_alphabet TEXT := 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890';
    curse_chars TEXT := 'csfhuit';
    curse TEXT := curse_chars || UPPER(curse_chars);
    alphabet TEXT := regexp_replace(clean_alphabet, '[' || curse  || ']', '', 'gi');
    shuffle_alphabet TEXT := consistent_shuffle(alphabet, salt);
    char_length INT := length(alphabet);
    output TEXT := '';
BEGIN
    WHILE id != 0 LOOP
        output := output || substr(shuffle_alphabet, (id % char_length) + 1, 1);
        id := trunc(id / char_length);
    END LOOP;
    curse := consistent_shuffle(curse, output || salt);
    output := RPAD(output, min_length, curse);
    RETURN output;
END;
$$ LANGUAGE plpgsql VOLATILE;

How-to use examples:

-- 3: min-length
select generate_uid(123, 3, 'salt'); -- output: "0mH"

-- or as default value in a table
CREATE SEQUENCE IF NOT EXISTS my_id_serial START 1;
CREATE TABLE collections (
    id TEXT PRIMARY KEY DEFAULT generate_uid(CAST (nextval('my_id_serial') AS INTEGER), 3, 'salt')
);
insert into collections DEFAULT VALUES ;
Dantio
  • 1,799
  • 1
  • 10
  • 13
1

This query generate required string. Just change second parasmeter of generate_series to choose length of random string.

SELECT
     string_agg(c, '')
FROM (
     SELECT
          chr(r + CASE WHEN r > 25 + 9 THEN 97 - 26 - 9 WHEN r > 9 THEN 64 - 9 ELSE 48 END) AS c
     FROM (
           SELECT
                 i,
                 (random() * 60)::int AS r
           FROM
                 generate_series(0, 62) AS i
          ) AS a
      ORDER BY i
     ) AS A;
Roman Tkachuk
  • 3,096
  • 1
  • 16
  • 15
1

So I had my own use-case for something like this. I am not proposing a solution to the top question, but if you are looking for something similar like I am, then try this out.

My use-case was that I needed to create a random external UUID (as a primary key) with as few characters as possible. Thankfully, the scenario did not have a requirement that a large amount of these would ever be needed (probably in the thousands only). Therefore a simple solution was a combination of using generate_uid() and checking to make sure that the next sequence was not already used.

Here is how I put it together:

CREATE OR REPLACE FUNCTION generate_id (
    in length INT
,   in for_table text
,   in for_column text
,   OUT next_id TEXT
) AS
$$
DECLARE
    id_is_used BOOLEAN;
    loop_count INT := 0;
    characters TEXT := 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
    loop_length INT;
BEGIN

LOOP
    next_id := '';
    loop_length := 0;
    WHILE loop_length < length LOOP
    next_id := next_id || substr(characters, get_byte(gen_random_bytes(length), loop_length) % length(characters) + 1, 1);
    loop_length := loop_length + 1;
    END LOOP;

    EXECUTE format('SELECT TRUE FROM %s WHERE %s = %s LIMIT 1', for_table, for_column, quote_literal(next_id)) into id_is_used;

    EXIT WHEN id_is_used IS NULL;

    loop_count := loop_count + 1;

    IF loop_count > 100 THEN
        RAISE EXCEPTION 'Too many loops. Might be reaching the practical limit for the given length.';
    END IF;
END LOOP;


END
$$
LANGUAGE plpgsql
STABLE
;

here is an example table usage:

create table some_table (
    id
        TEXT
        DEFAULT generate_id(6, 'some_table', 'id')
        PRIMARY KEY
)
;

and a test to see how it breaks:

DO
$$
DECLARE
    loop_count INT := 0;

BEGIN

-- WHILE LOOP
WHILE loop_count < 1000000
LOOP

    INSERT INTO some_table VALUES (DEFAULT);
    loop_count := loop_count + 1;
END LOOP;

END
$$ LANGUAGE plpgsql
;
Alexi Theodore
  • 1,177
  • 10
  • 16