Efficient method to find collision free random numbers

Question

I have a users table, the user ID is public. But I want to obfuscate the number of registered user and trends of the project, so I don't want to have public incrementing IDs.

When a new user is created I want to find a random integer number that is greater than a certain number and that is not yet in the database.

Naive code:

<?php
    $found = false;
    while(!$found) {
      $uid = rand(1000000000,4294967295) // find random number betwen minimum and maximum
      $dbh->beginTransaction();
      // check if user id is in use, and if not insert it
      if($dbh->query("SELECT * FROM users WHERE uid = $uid")) {
        $dbh->exec("INSERT INTO users (uid) VALUES ($uid)");
        $found = true;
      }
      $dbh->commit();
    }
    // we just got our new uid ...
?>

This will work it however may become inefficient. True that there is a big range and the probability of hitting an unused uid is high. But what if I want to use a smaller range, because I don't want to have so long userids?

Example of my concerns:

60% of all user ids are in use
the chance of hitting an unused uid are 0.4
the first attempt has 0.4% success rate
if 1st not successful the second attempt has 0.6*0.4 probability
so with a maximum of two tries i have 0.4 + 0.6*0.4 proability (is that right??)

So one method to optimize is that came to my mind is the following:

find a random number, check if its free, if not, increment it by 1 and try again and so on
if the maximum number is hit, continue with the minimum number

That should give me a number with a maximum runtime of O(range)

That sounds pretty bad but I think it is not, because I submit random numbers to the database and that they are all at the beginnig is very unlikely. So how good/bad is it really?

I think this would work just fine but I want it BETTER

So what about this?

find a random number
query the database for how many numbers are occupied in the range whole range, starting from that number (this first step is trivial...)
if there are numbers occupied in that range, divide the range by half and try again. starting with the initial number
if there are numbers occupied divide the range by half and try again. starting with the initial number

If I am thinking correctly this will give ma a number with a maximum of O(log(range)) time.

That is pretty satisfying because log() is pretty good. However I think this method will often be as bad as possible. Because with our random numbers we will probably always hit numbers in the large intervals.

So at the beginning our pure random method is probably better.

So what about having a limit like this

select current number of used numbers
is it greater than X, logarithmic range approach
if it is not, use pure random method

What would X be and why?

So final question:

This is pretty easy and pretty complicated at the same time.

I think this is a standard problem because lots and lots of system use random ids (support tickets etc), so I cannot imagine I am the first one to stumble across this.

How would you solve this? Any input is appriciated!

Is there maby an existing class / procedure for this I can use?

Or maby some database functions that I can use?

I would like to do it in PHP/Mysql

IMPORTANT EDIT:

I just thought about the range/logarithmic solution. It seems to be complete bullshit sorry for my wording because:

what if i hit an occupied number at start?

Then I am dividing my range so long if it is only 1. And even then the number is occoupied.

So its completely the same as the pure random method from start, only worse....

I am a bit embarassed I made this up but I will leave it in because I think its a good example of overcomplicated thinknig!

My advice that doesn't answer the question necessarily is don't make the user id public if it's being used a key. Why do users need to know this? Why can't they just know username if they need to identify a user. You are opening up an implementation detail which is bad architecturally and also a potential security risk. — James Gaunt, Jul 24 '11 at 17:53
You can assume for the problem that the public exporuse of the ID is no security risk. The reason behind this is that there is no name. Or better said, I want it to be changeable by the user, but keep a unique identifier. The user is also not required to choose a username, and I don't want to display him with a huge random GUID. FYI: Stackoverflow does knid of the same. Try opening a new account with no username, there will be somethin like user12345 assigend as your nickname. Apart form that, I am interested in the problem itself, but thanks for your input. That was important 2b said. +1 — The Surrican, Jul 24 '11 at 17:56
Why not do something similar to stackoverflow and assign some sort of username in the same format to them then if that is the functionality that you are looking for? — Mike, Jul 24 '11 at 18:08
BTW, if it's 0.4 chance of success each time, then for n attempts the chance of success is (1 - 0.6^n) which gets close to 1 pretty damn quick. — James Gaunt, Jul 24 '11 at 18:10
I would probably use UUID v4 in your situation. See http://stackoverflow.com/questions/2040240/php-function-to-generate-v4-uuid — Finbarr, Jul 24 '11 at 18:11
Possible duplicate of [Pseudo-random-looking one-to-one int32->int32 function](https://stackoverflow.com/questions/15533937/pseudo-random-looking-one-to-one-int32-int32-function) — Peter O., Jul 30 '17 at 23:45

score 13 · Accepted Answer · answered Jul 24 '11 at 18:11

If p is the proportion of ids in use, your "naive" solution will, on average, require 1/(1-p) attempts to find an unused id. (See Exponential distribution). In the case of 60% occupancy, that is a mere 1/0.4 = 2.5 queries ...

Your "improved" solution requires about log(n) database calls, where n is the number of ids in use. That is quite a bit more than the "naive" solution. Also, your improved solution is incomplete (for instance, it does not handle the case where all number in a subrange are taken, and does not elaborate with subrange you recurse into) and is more complex to implement to boot.

Finally, note that your implementation will only be thread safe if the database provides very strict transaction isolation, which scales poorly, and might not be the default behaviour of your database system. If that turns out to be a problem, you could speculatively insert with a random id, and retry in the event of a constraint violation.

But I always have to keep an eye on that the range does not get filled up? Because if 99% are in use it would take on average 100 queries - correct? — The Surrican, Jul 24 '11 at 18:26
If 99% of your user id space is in use, you (will soon) have bigger problems than slow performance. — Ilmari Karonen, Jul 24 '11 at 22:07

score 3 · Answer 2 · answered Jul 24 '11 at 18:12

3

How about you compose the number from something that won't clash and a random number in some small range.

 ddddd-(rrr+n)

The ddddd being for example the number of days your system has been live, the rrr is a random number chosen each day, the n is the increment within day.

Given any one number a person not knowing rrr for the day cannot deduce how many users have been created on a given day.

answered Jul 24 '11 at 18:12

djna

54,992
14
74
117

that, sir, is a pretty awesome idea. – The Surrican Jul 24 '11 at 18:15

score 2 · Answer 3 · answered Jul 24 '11 at 18:07

2

If you don't want to test for used numbers, you can create a function that calculates a random looking id $id_k based on the automatically incrementing id from the database $id:

$id_k = transpose($id);

That function has either a reverse cousin or is able to transpose transparently back (ideally):

$id = transpose($id_k);

You can then used the transposed IDs on your site.

Another idea that comes into my mind is that you precalculate the random id's per incrementing ids so you can control the use of the database better.

answered Jul 24 '11 at 18:07

hakre

193,403
52
435
836

the problem is that this 'security' is based on the knowledge of function transpose(). Once somebody knows the algorithm, he can easily guess the ids. – Tomas Jul 24 '11 at 18:11
Sure, OP has asked for obfuscation only specifically. Even with "true" random keys in your database, someone who knows the database data, knows the ids as well. – hakre Jul 24 '11 at 18:12
Someone who knows database data knows everything :-) This is complete different security break than knowning the algorithm. – Tomas Jul 24 '11 at 20:24
2

Actually, this _can_ work, if `transpose` is good enough... but good enough here basically means a cryptographically secure (adjustable size) block cipher with a secret key, which is probably a rather heavy tool to use for a problem like this. – Ilmari Karonen Jul 24 '11 at 22:29

Tomas · Answer 4 · 2011-07-24T18:19:59.353

2

Joe, just implement your algorithm as above with no worry. Just look: if the probability of hitting used id is p = 0.6, then the probability that you hit used id N consecutive times is p^N. This goes down exponentially! I'd recommend to set the ID density lower, e.g. to p = 0.1. Then the probability that you don't suceed for 10 consecutive attempts is p^10 = 0.1^10 = 1e-10!!! Absolutely negligible.

Don't worry of the collisions and go for your solution.

edited Jul 24 '11 at 18:19

answered Jul 24 '11 at 18:09

Tomas

57,621
49
238
373

I like this solution the best. If Joe can't let go of the worry of collisions, then Joe could precompute a long list of unique random numbers and hand them out as needed. – emory Jul 24 '11 at 20:41

score 1 · Answer 5 · answered Jul 24 '11 at 18:05

How about when you start the application, pick a random range of 100 numbers (e.g. 100 - 199, 1000 - 1099, 5400 - 5499), check the first one, if it's not in the database we know (based on this algorithm) that all 100 are free. Store the start of this range in memory.

Then just allocate these until you run out (or your application recycles) and then pick another random range. So you only need to go to the database once every 100 users.

This is similar to the Nhibernate hi/lo approach (except for the random bit).

Obviously tweak to 100 depending on the rate at which you allocate ids compared to the typical life span of the application in memory.

But this way once you see one id the other ids could be easily guessed... moreover, the i/o issues are low level, let's focus on the algorithm. BTW, AFAIK in PHP the lifetime of the 'app' is one http query. — Tomas, Jul 24 '11 at 18:18

score 1 · Answer 6 · answered Jul 24 '11 at 18:05

1

You could simply use any hash shuffle algorithm to generate new Id value by the known number of users (holding of this value is a usual practice). This approach could be better than your current solution because the corresponding algorithm will likely generate less collisions. The key point is to choose algorithm with appropriate strength and distribution uniformity.

answered Jul 24 '11 at 18:05

eugene_che

1,997
12
12

I don't quite understand by what you mean with 'You could simply use any hash shuffle algorithm to generate new Id value by the known number of users (holding of this value is a usual practice)' - could you explain it further please? – Tomas Jul 24 '11 at 18:13
@Tomas, I mean that [hash](http://en.wikipedia.org/wiki/Hash_function) function could be more efficient for this purpose than the random one. For example, you can get ids with the following formula `id = (user_num*777+ (user_num>>4)*3) mod 111111`. This one is pretty simple, but it could be more complicated. – eugene_che Jul 24 '11 at 18:23
OK, normal hash. But then I don't see how is that more effective than the proposed solution. You have to check for collisions also and moreover, you have to store this new id also. I think it is much easier to generate a random number with much more assured result regarding the distribution uniformity. – Tomas Jul 24 '11 at 20:34
@Tomas. Lets assume that you want to generate ids in range from 10 to 20 for 10 users. Hash function for id generation `10 + user_num*7 mod 10` will never lead to collision. If you use the random function with the same input load the probability of collision on the last step is 0.9. Of course this is a toy example, but it illustrates the idea. – eugene_che Jul 24 '11 at 21:08

score 1 · Answer 7 · edited May 23 '17 at 11:50

To expand on on meriton's and Tomas Telensky's answers, if you want to keep your user IDs short while ensuring that you don't run out of them, you could select each ID randomly from, say, the range 1 to 10*n+1000, where n is the current number of users you have.

That way, your effective user ID space will never be more than 10% full, while the user IDs will (in the long run) be only about one digit longer than if you'd assigned them sequentially. The down side, of course, is that the IDs will no longer be completely uncorrelated with registration order: if someone has the ID 5851, you know they must be at least the 486th registered user and that they're unlikely to be, say, the 50000th. (Of course, you introduce the same kind of correlations if you ever manually adjust the range to accommodate more users.)

Of course you can adjust the constants 10 and 1000 above: the bigger they are, the longer and more random your user IDs will be.

Craig McQueen · Answer 8 · 2011-08-02T02:45:56.570

You could use symmetric-key encryption (e.g. AES) to encrypt a counter. If you use the entire output (128-bits for AES) then you're guaranteed no collisions, and it's a reversible mapping.

128 bits might be more than you want to deal with though—it's a 32-digit hex number, 39-digit decimal number. You could use a 64-bit encryption algorithm such as DES, Blowfish or Misty (16-digit hex number, 20-digit decimal number).

Efficient method to find collision free random numbers

8 Answers8