1

I have a column that I would like to default to a randomly generated int8 in a specified range. I would also like this column to be unique, so if a random value is generated that already exists, it should be rerolled.

So my question is what the most idiomatic way to do the above is in PostgreSQL, ideally with good performance and supporting bulk inserts.

For example if I had a Person table with columns name and id, and I want id to be a random unique int8 in the range (0, 999999). I would want to be able to insert Paul, Kelly, David and Katie and get something like the following:

| Name  |   id   |
+-------+--------+
| Paul  | 314563 |
| Kelly | 592103 |
| David | 127318 |
| Katie | 893134 |

With no risk of duplicates and no risk of an insertion failure.

The range is not going to be large enough for me to safely assume they will never collide (i.e Birthday Paradox).

I should also say I do want true unpredictable randomness, so a cipher on a sequence would not count.

There are a variety of answers on how to generate random numbers, so the main focus of the question is the uniqueness aspect.

With that said a clean and efficient way to generate an int8 uniformly in an arbitrarily large range would be appreciated. random() * n starts having gaps when n > 2 ^ 53 (perhaps earlier).

semicolon
  • 2,530
  • 27
  • 37
  • Please show sample data. What do you *mean* by a unique randomly generated number? – Gordon Linoff Dec 19 '19 at 12:10
  • Edited the question. I just mean an `int8` approximately uniformly generated in a range that I would be able to specify. – semicolon Dec 19 '19 at 12:16
  • Thats called data encoding – Himanshu Dec 19 '19 at 12:18
  • @HimanshuAhuja What do you mean? – semicolon Dec 19 '19 at 12:19
  • Does this answer your question? [How do you create a random string that's suitable for a session ID in PostgreSQL?](https://stackoverflow.com/questions/3970795/how-do-you-create-a-random-string-thats-suitable-for-a-session-id-in-postgresql) – Himanshu Dec 19 '19 at 12:22
  • @HimanshuAhuja No it does not. That does not specify how to handle collisions, and since I will be restricting the integers to within a more reasonable range, collisions will eventually happen. – semicolon Dec 19 '19 at 12:26
  • What if you have more data and all the randoms have been generated within the range already also if you have duplicates or redundancy? – Himanshu Dec 19 '19 at 12:27
  • @HimanshuAhuja The range will be large enough for that to take quite a long time and give me plenty of time to adjust the range. It's just that due to the birthday paradox even a range of 1 trillion has a decent chance of causing a collision only 1 million or so keys in. – semicolon Dec 19 '19 at 12:31
  • The difficulty won't be to generate a random 6 digit number. You'll find solutions for that. F.e. [here](https://stackoverflow.com/a/41988979/4003419) (with a bit of tweaking for only digits). The challenge will be to assure uniqueness. Which I don't expect is doable for a column default. – LukStorms Dec 19 '19 at 12:36
  • @LukStorms Yeah agreed, edited question to emphasize that part. I was hoping there was an easy way for the `default` value to "interact with" the B-tree used by the column in an atomic way. – semicolon Dec 19 '19 at 12:44
  • Btw, why would you want to do this? It's just that I've noticed more questions regarding data scramble/masking because of [GDPR](https://en.m.wikipedia.org/wiki/General_Data_Protection_Regulation). But to see one asked for random id's is rather exceptional. – LukStorms Dec 19 '19 at 12:55
  • 1
    @LukStorms The keys will be used in URLs and internally in client code, so I would like to not reveal information such as creation order or row creation rate/count. – semicolon Dec 19 '19 at 12:56
  • Just use a `uuid` and be done with it. –  Dec 19 '19 at 14:07
  • I guess this whole problem is unnecessary why not have simply autoincrement this is just an overexaggerated made up problem – Himanshu Dec 19 '19 at 15:22
  • @a_horse_with_no_name They would be far too large to store reasonably in a URL or for a client to read over the phone. – semicolon Dec 19 '19 at 22:44
  • @HimanshuAhuja These are the specific business requirements, I can't just disregard them lol. We do not want the number of rows created or the order of row creation to be calculable. – semicolon Dec 19 '19 at 22:45
  • Related question, for context: https://stackoverflow.com/questions/59410237/key-generation-that-is-random-unique-db-wide-and-bounded – Peter O. Dec 20 '19 at 15:49

1 Answers1

0

A possible solution :

create table t (name varchar (50), id int);

-- 1. generate a list of possible ids
-- 2. cast the id in varchar to make a string after that
-- 3. aggregate all the possible ids in a string with a ',' separator
-- 4. make the string a list
-- 5. select a random value in this list
-- 6. insert the new id for the wanted name. Here 'test'
with cte as 
(
  SELECT a.n as possible_id
  from generate_series(1, 150000) as a(n)
  where not exists (select 1 from t where t.id = a.n)
)
, cte_s as 
(
  select 
    (
        string_to_array( 
            string_agg( 
                cast(possible_id as varchar)
                , ','
            )
            , ','
        )
    )[floor(random() * 150000 + 1)] as new_id
  from cte
)
insert into t
values ('test', (select new_id from cte_s)::int); 

-- test that your code doing what you want
select *
from t;

http://sqlfiddle.com/#!17/17d42/26

You can modify the maximum amount as you want of course.

Jaisus
  • 1,019
  • 5
  • 14
  • Would you mind going through the code a little? I'm not super sure what all the string stuff is for, it seems like I would want to stick to `int8` besides the `'test'` name. – semicolon Dec 27 '19 at 00:13
  • I edited the answer as requested. You have the sqlfiddle link to test it but do not hesitate to ask if you have any other question. – Jaisus Jan 07 '20 at 08:42