3

I read several topics similar to what I'm asking, but none of them seemed to be very helpful to me.

I have a form where users can generate codes that are stored in a column with a Unique constraint. The codes are strings with length of 7 characters. The users can enter a number and the program generates that many codes, and this can be repeated until the maximum number of codes is reached.

My problem is with duplicate values. But not with values that are already present in the database in the moment of entering new entries(I check for those successfully), but some of the entries in the new group of (say 10000) codes are (probably) identical. So my code generates two(or more) identical codes in the same transaction and the Unique constraint in the DB complains about it.

I thought of checking the database after each entry, but it is extremely time consuming, considering we're talking about 10000 or sometimes more entries.

So now I think the only option is to modify the code that generates them in the first place, cause it seems to be inefficient and generate doubles.

A big part of the problem is the required length of the codes, otherwise I would go with pure 'uniqid()' or something similar, but since I have to restrict it to 7 characters I guess that makes it a lot worse. Also, I have to exclude some characters from the code[labeled 'problem_characters'] in the code.

Here's the code, I couldn't modify it properly to generate unique values only.

$problem_characters = array("0", "o", "O", "I", "1", 1);

$code = md5(uniqid(rand(), true));

$extId = strtoupper(str_replace($problem_characters,rand(2,9),substr($code, 0, 7)));

//insert $extId in the database

@Geo Ok, I tried your solution and it was working (of course), but then I got a new problem - in the 'else' part of your 'if' I'm doing the following:

$extId = strtoupper(str_replace($problem_characters,rand(2,9),substr($code, 0, 7)));

while(true){     

      if((!in_array($extId, $allExternalIdsHandled)) && (!in_array($extId, $newEnteredValues))){
       break;
        }else{
 $extId = strtoupper(str_replace($problem_characters,rand(2,9),substr($code, 0, 7)));   }
               }
//insert the modified value in the DB here

So, now it's entering an endless loop and it's not breaking out with the 'break' command even though it ought to be changed with the execution of the 'random' call and then enter the if and break out...

I do not see the problem here. Can someone give me some direction, please?

EDIT: It sometimes hangs, sometimes does not. I just entered 10000 values and got two entries modified via the 'else' path. I observed this using logs.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880

3 Answers3

3

There are already libraries doing the hard work for you, allowing you to select the "alphabet" to use when generating the string and the length of the string.

Your "identical entries" problem is called a collision and it can't be avoided.

Edit So, similar to what was suggested by Geo, I'm using a PHP to create a list of n unique entries. The difference is that the SQL insert might fail, so I'm having 2 layers of iteration to make sure that we fill the total number desired:

<?php

require('hashids.php'); // I'm using the library I suggested

$hashids = new hashids('some salt', 7); // use the default alphabet, feel free to pass the 3rd parameter with the alphabet you want to use

$generationTries = 0;

$hashesInDBCount = 0; // get from your database
$desiredHashesCount = 50; // use a parameter
$totalDesiredHashes = $hashesInDBCount + $desiredHashesCount;
do
{
    // when coming back in the loop, only generate what's still required
    $desiredHashesCount = $totalDesiredHashes - $hashesInDBCount; 
    $generatedHashesCount = 0;
    $generatedHashes = array();

    while($generatedHashesCount < $desiredHashesCount)
    {
        $hash = $hashids->encrypt($generationTries++);
        if(!in_array($hash, $generatedHashes))
        {
            array_push($generatedHashes, $hash);
            ++$generatedHashesCount;
        }
    }

    // insert $generatedHashes in your Database

    $hashesInDBCount = 50; // again, query your database as you might come through this loop more than once, 
                           // I'm hardcoding the value to have a working example
}
while($hashesInDBCount < $totalDesiredHashes);

echo "Generated " . count($generatedHashes) . " hashes in " . $generationTries . " tries\n";
var_dump($generatedHashes);

Which gives me an aoutput like:

Generated 50 hashes in 50 tries
array(50) {
  [0]=>
  string(7) "eAcgAcx"
  [1]=>
  string(7) "Exidai8"
  [2]=>
  string(7) "ExTbqT8"
  [3]=>
  string(7) "4Acz8cB"
  [4]=>
  string(7) "LRipxir"
  [5]=>
  string(7) "zATe5Tx"
  ...
}

Adding a random salt will give you random values every time

emartel
  • 7,712
  • 1
  • 30
  • 58
  • No, the values are not important but the format is important(7 characters with the problematic characters left out). The code I provided is running inside a for() loop. I am terribly sorry, but I didn't fully understand your solution. Do you think I should create a while() loop inside the for() loop? Do you think I should set the target count to count+n outside the for() or inside? I don't really get the solution, sorry. Would you be kind to explain it furtherly? – thebloodycoon Dec 27 '12 at 08:50
  • Ok let me rephrase my answer, hopefully it will fit your requirements – emartel Dec 27 '12 at 21:16
  • I tried editing with my phone and had formatting issues with the code, I'll update when I get home, sorry for the delays – emartel Dec 27 '12 at 21:34
  • The only thing I didn't do was add your custom alphabet, but it should be easy enough to change! Let me know if you need me to do it – emartel Jan 12 '13 at 18:01
1
<?php

$problem_characters = array('0', 'o', 'O', 'I', '1', 1);
$length = 10000;
$i = 0;
$hashes = array();
while ($i < $length) {
    $code = md5(uniqid(rand(), TRUE));
    $extId = strtoupper(str_replace($problem_characters, rand(2, 9), substr($code, 0, 7)));
    if ( ! in_array($extId, $hashes)) {
        $hashes[] = $extId;
        $i++;
        // insert $extId in the database
    }
}
Geo
  • 12,666
  • 4
  • 40
  • 55
1

First - You are using md5 in order to generate the string code,as md5 is hex encoded string, you are severely reducing the number of possible combinations, generating random string with 30 possible chars gives you 21 bilion (10^9) possibilities rather than 268 million (10^6) with hex chars

Onther thing - you can never vreate really unique values (guid-s are machine-unique), the probability to generate the same value twice increases with shorter strings.

I can thing of three different approaches (I assume you have at least 30 non problematic chars) you can crate unique non random values. lets say tahat you have two counters generation request count and request counter. so if user 1 asks for 100 codes, codes like user_request_counter-code_counter : '00-00-00_00-00-00-01' to '00-00-01_00-00-03-00' are sure to be unique (and are actually 7 - each group of two digits up to 30 can be represented with single char(just like hex does with 16 chars - you can pick any base you like) this will allow you to create 30^4 (810,000) codes to 30^3 (27,000) users. This way you don't need to use expensive random calls and don't need to worry about duplicate codes. You can even make it a bit random by randomizing the coupon counter and assigning each code generation request random rather than counter but still the codes for every user will have the same prefix

The second approach that I used once it to simply fill the DB with random codes and then just assign them to the users, this is useful as you only need to do it once in a while, and the new code generation can be done done offline (with dump of the db) and then pushed to the server, this way you get awesome performance o(1) on code generation on the php code, o(1) on the db server side as there is no need to update the table index thousands of times as you would do when you insert the php generated codes to the database.

The third approach in case your only problem is that you get duplicates in the php generated values is to put them in array and then check if the values is new. as php arrays are implemented as hashtables you will get pretty good performance).

If you choose to generate the codes in PHP randomly - you will always have to face two problems - the first - there is no guarantee that the code won't exist in the database so you will always need to handle double keys problems and the second is that as you need to generate many codes - inserting them in to db will be pretty expensive on the sql server side and as you have more codes - will slow the script significantly

SimSimY
  • 3,616
  • 2
  • 30
  • 35
  • Ok, about the md5() part of your answer - do you think I should just remove it, like this - `$code = uniqid(rand(), true);` or do you think I should do it like this - `$code = str_shuffle("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")`? – thebloodycoon Dec 27 '12 at 09:10
  • check the second answer here http://stackoverflow.com/questions/853813/how-to-create-a-random-string-using-php `function randString($length, $charset='ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789') { $str = ''; $count = strlen($charset); while ($length--) { $str .= $charset[mt_rand(0, $count-1)]; } return $str; }` – SimSimY Dec 27 '12 at 09:56