1

I need to generate close to a million(100 batches of 10000 numbers) unique and random 12 digit codes for a scratch card application. This process will be repeated and will need an equal number of codes to be generated everytime.

Also the generated codes need to be entered in a db so that they can be verified later when a consumer enters this on my website. I am using PHP and Mysql to do this. These are the steps I am following

  1. Get admin input on the number of batches and the codes per batch

  2. Using for loop generate the code using mt_rand(100000000000,999999999999)

  3. Check every time a number is generated to see if a duplicate exists in the db and if not add to results variable else regenerate.

  4. Save generated number in db if unique

  5. Repeat b,c, and d over required number of codes

  6. Output codes to admin in a csv

Code used(removed most of the comments to make it less verbose and because I have already explained the steps earlier):

$totalLabels = $numBatch*$numLabelsPerBatch;
// file name for download
$fileName = $customerName."_scratchcodes_" . date('Ymdhs') . ".csv";
$flag = false;
$generatedCodeInfo = array();
// headers for download
header("Content-Disposition: attachment; filename=\"$fileName\"");
header("Content-Type: application/vnd.ms-excel");
$codeObject = new Codes();
//get new batch number 
$batchNumber = $codeObject->getLastBatchNumber() + 1;
$random = array();
for ($i = 0; $i < $totalLabels; $i++) {
    do{
        $random[$i] = mt_rand(100000000000,999999999999); //need to optimize this to reduce collisions given the databse will be grow
    }while(isCodeNotUnique($random[$i],$db));
    $codeObject = new Codes();
    $codeObject->UID = $random[$i];
    $codeObject->customerName = $customerName;
    $codeObject->batchNumber = $batchNumber;
    $generatedCodeInfo[$i] = $codeObject->addCode();

    //change batch number for next batch
    if($i == ($numLabelsPerBatch-1)){$batchNumber++;}


    //$generatedCodeInfo[i] = array("UID" => 10001,"OID"=>$random[$i]);
    if(!$flag) {
        // display column names as first row
        echo implode("\t", array_keys($generatedCodeInfo[$i])) . "\n";
        $flag = true;
    }
    // filter data
    array_walk($generatedCodeInfo[$i], 'filterData');
    echo implode("\t", array_values($generatedCodeInfo[$i])) . "\n";


}


function filterData(&$str)
{
    $str = preg_replace("/\t/", "\\t", $str);
    $str = preg_replace("/\r?\n/", "\\n", $str);
    if(strstr($str, '"')) $str = '"' . str_replace('"', '""', $str) . '"';
}

function isCodeNotUnique($random){
    $codeObject = new Codes();
    $codeObject->UID = $random;
    if(!empty($codeObject->getCodeByUID())){
        return true;
    }
    return false;
}

Now this is taking really long to execute and I believe is not optimal.

  1. How can I optimize so that the unique random numbers are generated quickly?

  2. Will it be faster if the numbers were instead generated in mysql or other way rather than php and if so how do I do that?

  3. When the db starts growing the duplicate check in step b will be really time consuming so how do I avoid that?

  4. Is there a limit on the number of rows in mysql?

Note: The numbers need to be unique across all batches across lifetime of the application.

FBP
  • 345
  • 3
  • 15
  • Are there codes in the DB already? If not you can just generate all codes in PHP and push them to MySQL in a single transaction. May be faster to check for duplicates in PHP. Also 1 million numbers is basically nothing on a modern machine. Both PHP and MySQL can handle it without too much trouble. – apokryfos Nov 01 '16 at 10:02
  • 3) Not if your database is properly indexed; 4) For this relatively small volume of data, no problem, thousands of billions is large, a million isn't – Mark Baker Nov 01 '16 at 10:12
  • @apokryfos not today but once I generate the first batch , this will continue to grow. Can you please elaborate on "If not you can just generate all codes in PHP and push them to MySQL in a single transaction. May be faster to check for duplicates in PHP." – FBP Nov 01 '16 at 11:05
  • 1
    I would do shadow's approach. 1) generate 100 random 3 digit numbers (batches) (range 100 - 999). 2) For each batch generate 10000 unique numbers (range 0 - 999999999) . There will not be a lot of collisions so it will be quick. Combine batch id with the 9 digit batch numbers to give a 12 digit unique number. – Ryan Vincent Nov 01 '16 at 23:05
  • Full implementation here: http://pastebin.com/Wmh9ueZ1. It generates and saves one batch at a time so there should not be memory issues. Amend the constants to get the full values generated an saved. On my system the DB save is very slow (30 sec per batch) but it is a 'dual hamster powered' pc :) – Ryan Vincent Nov 02 '16 at 11:07

5 Answers5

1

1) Divide your range of numbers up to smaller ranges based on the number of batches. E.g. if your range 0 - 1000 and you have 10 batches, then have a batch from 0 - 99, the next 100 - 199, etc. When you generate the numbers for a batch, only generate the random number from the batch range. This way you know that you can only have duplicate numbers within a batch.

Do not insert each number into the database individually, but store them in an array. When you generate a new random number, then check against the array, not the database using in_array() function. When the batch is complete, then use a single insert statement to insert the contents of the batch:

insert into yourtable (bignumber) values (1), (2), ..., (n)

Check MySQL's max_allowed_packet setting to see if it is able to receive the complete sql statement in one go.

Implement a fallback plan, just in case a duplicate value is still found during the insert (error handling and number regeneration).

2) MySQL is not that great on procedural stuff, so I would stick with an external language, such as php.

3) Add a unique index on the field containing the random numbers. If you try to insert a duplicate record, MySQL will prevent it and throws an error. It is really quick.

4) Depending on the actual table engine used (innodb, myisam, etc), its configuration, and the OS, certain limits may apply on the size of the table. See Maximum number of records in a MySQL database table question here on SO for a more detailed answer (check the most upvoted answer, not the accepted one).

Community
  • 1
  • 1
Shadow
  • 33,525
  • 10
  • 51
  • 64
  • I didn't understand your first para in step 1. Can you please provide an example? – FBP Nov 01 '16 at 11:19
  • This one - Divide your range of numbers up to smaller ranges based on the number of batches. – FBP Nov 01 '16 at 12:30
  • I did provide you an example for that in the answer. What is unclear about that? – Shadow Nov 01 '16 at 13:00
  • Ryan's comment helped me understand the concept. However insert into yourtable (bignumber) values (1), (2), ..., (n) is causing a "Allowed memory size of 134217728 bytes exhausted (tried to allocate 79 bytes) " error in PHP. Any idea how to avoid that? – FBP Nov 02 '16 at 08:20
  • Limit your batch sizes (number of unique numbers generated in batch) and free up unneeded variables. – Shadow Nov 02 '16 at 09:18
1

You can do the following:

$random = getExistingCodes(); // Get what you already have (from the DB).  
$random = array_flip($random); //Make them into keys
$existingCount = count($random); //The codes you already have 

do {
    $random[mt_rand(100000000000,999999999999)] = 1;
} while ((count($random)-$existingCount) < $totalLabels);

$random = array_keys($random);

When you generate a duplicate number it will just overwrite that key and not increase the count.

To insert you can start a transaction and do as many inserts as needed. MySQL will try to optimize all operations within a single transaction.

apokryfos
  • 38,771
  • 9
  • 70
  • 114
  • this requires me to use $newCodes = array_diff($random,$existingRandom); to extract the new codes but the same is failing because of memory constraint (Allowed memory size of 134217728 bytes exhausted) – FBP Nov 02 '16 at 10:02
0

Here is a query that generates 1 million pseudo-random numbers without repetitions:

select cast(  (@n := (13*@n + 97) % 899999999981)+1e11 as char(12)) as num
from   (select @n := floor(rand() * 9e11) ) init,
       (select 1 union select 2) m01,
       (select 1 union select 2) m02,
       (select 1 union select 2) m03,
       (select 1 union select 2) m04,
       (select 1 union select 2) m05,
       (select 1 union select 2) m06,
       (select 1 union select 2) m07,
       (select 1 union select 2) m08,
       (select 1 union select 2) m09,
       (select 1 union select 2) m10,
       (select 1 union select 2) m11,
       (select 1 union select 2) m12,
       (select 1 union select 2) m13,
       (select 1 union select 2) m14,
       (select 1 union select 2) m15,
       (select 1 union select 2) m16,
       (select 1 union select 2) m17,
       (select 1 union select 2) m18,
       (select 1 union select 2) m19,
       (select 1 union select 2) m20
limit 1000000;

How it works

It starts by generating a random integer value n with 0 <= n < 900000000000. This number will have the function of the seed for the generated sequence:

@n := floor(rand() * 9e11)

Through multiple (20) joins with inline pairs of records, this single record is multiplied to 220 copies, which is just a bit over 1 million.

Then the selection starts, and as record after record is fetched, the value of the @n variable is modified according to this incremental formula:

@n := (13*@n + 97) % 899999999981

This formula is a linear congruential generator. The three constant numbers need to obey some rules to maximise the period (of non-repetition), but it is the easiest when 899999999981 is prime, which it is. In that case we have a period of 899999999981, meaning that the first 899999999981 generated numbers will be unique (and we need much less). This number is in fact the largest prime below 900000000000.

As a final step, 100000000000 is added to the number to ensure the number always has 12 digits, so excluding numbers that are smaller than 100000000000. Because of the choice of 899999999981 there will be 20 numbers that will never be generated, namely those between 999999999981 and 999999999999 inclusive.

As this generates 220 records, the limit clause will make sure this is chopped off to exactly one million records.

The cast to char(12) is optional, but may be necessary to visualise the 12-digit numbers without them being rendered on the screen in scientific notation. If you will use this to insert records, and the target data type is numeric, then you would leave out this conversion of course.

trincot
  • 317,000
  • 35
  • 244
  • 286
  • If we want 100 random numbers we use a limit of 100 and if we want the random number to be 14 digits we change 899999999981 to the largest prime number below 90000000000000 we can call it lar. The first select becomes select cast( (@n := (13*@n + 97) % lar)+1e14 as char(12)) as num. Beautiful logic – broswilli Jan 07 '20 at 06:31
0
CREATE TABLE x (v BIGINT(12) ZEROFILL NOT NULL PRIMARY KEY);

INSERT IGNORE INTO x (v) VALUES
    (FLOOR(1e12*RAND()), (FLOOR(1e12*RAND()), (FLOOR(1e12*RAND()),
    (FLOOR(1e12*RAND()), (FLOOR(1e12*RAND()), (FLOOR(1e12*RAND()),
    (FLOOR(1e12*RAND()), (FLOOR(1e12*RAND()), (FLOOR(1e12*RAND()),
    (FLOOR(1e12*RAND()), (FLOOR(1e12*RAND()), (FLOOR(1e12*RAND()),
    (FLOOR(1e12*RAND()), (FLOOR(1e12*RAND()), (FLOOR(1e12*RAND());

Do that INSERT 1e6/15 times.

Check COUNT(*) to see if you have a million. Do this until the table as a million rows:

INSERT IGNORE INTO x (v) VALUES
    (FLOOR(1e12*RAND());

Notes:

  • ZEROFILL is assuming that you want the display to have leading zeros.
  • IGNORE is because there will be some number of duplicates. This avoids the costly check after each insert.
  • "Batch insert" is faster than one row at a time. (Doing 100 at a time is about optimal, but I am lazy.)
  • Potential problem: While I think the pattern of values for RAND() does not repeat at, say 2^16 or 2^32 values, I do not know for a fact. If you can't get to a million, then the random number generator is bad; you should switch to PHP's rand, or something else.
  • Beware of linear consequential random number generators. They are probably easily hacked. (I assume there is some "money" behind the scratch cards.)
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • See https://stackoverflow.com/a/58459869/1766831 -- Since RAND() repeats after 2^30 calls (about 9 digits), you won't get much in the way of "12-digit" randomness. – Rick James Jan 09 '21 at 23:07
0

Do not plan on mt_rand() being unique for small ranges

<?php
// Does mt_rand() repeat?

TryMT(100);
TryMT(100);
TryMT(1000);
TryMT(10000);
TryMT(1e6);
TryMT(1e8);
TryMT(1e10);
TryMT(1e12);
TryMT(1e14);

function TryMT($max) {
    $h = [];
    for ($j = 0; $j<$max; $j++) {
        $v = mt_rand(1, $max);
        if (isset($h[$v])) {
            echo "Dup after $j iterations (limit=$max)<br>\n";

            return;
        }
        $h[$v] = 1;
    }
}

Sample output:

Dup after 7 iterations (limit=100)<br>
Dup after 13 iterations (limit=100)<br>
Dup after 29 iterations (limit=1000)<br>
Dup after 253 iterations (limit=10000)<br>
Dup after 245 iterations (limit=1000000)<br>
Dup after 3407 iterations (limit=100000000)<br>
Dup after 29667 iterations (limit=10000000000)<br>
Dup after 82046 iterations (limit=1000000000000)<br>
Dup after 42603 iterations (limit=1.0E+14)<br>

mt_rand() is a "good" random number generated because it does have dups.

Rick James
  • 135,179
  • 13
  • 127
  • 222