0

I need to test out some queries on some large datasets (1M - 50M) for example and I'm having trouble writing a process that will do this fast.

I have the below code, which is slightly modified from a version of it I found on another SO post:

set_time_limit(0);

$db_host = 'localhost';
$db_name = 'test';

$db_user = '';
$db_pass = '';

$entries = 1000000;
$entry_words_min = 250;
$entry_words_max = 1000;

/*
  End configuration
*/

function get_rand_word( $len_min, $len_max ) {
    for ( $i = 0; $i < ( rand( 0, $len_max - $len_min ) + $len_min ); $i++ ) {
        $word .= chr(rand(65, 90));
    }
    return $word;
}
function get_title() {
    for ( $i = 0; $i < ( rand( 4, 10 ) ); $i++ ) {
        $title .= get_rand_word( 2, 9 ) . ' ';
    }
    return $title;
}
function get_fulltext() {
    for ( $i = 0; $i < ( rand( 250, 500 ) ); $i++ ) {
        $fulltext .= get_rand_word( 2, 9 ) . ' ';
    }
    return $fulltext;
}

$dsn = 'mysql:dbname=' . $db_name . ';host=' . $db_host;

try {
    $dbh = new PDO($dsn, $db_user, $db_pass);
} catch (PDOException $e) {
    echo 'Connection failed: ' . $e->getMessage();
    die();
}

$sth = $dbh->prepare('INSERT INTO `sphinx` (`some_id`,`title`,`content`) VALUES (:some_id, :title, :content)');

$counter = 0;

for ( $i = 0; $i < $entries; $i++ ) {
    $sth->execute(array(
        ':some_id' => mt_rand(1,65000),
        ':title' => get_title(),
        ':content' => get_fulltext()
    ));
    $counter++;
}

echo $counter . ' rows inserted';

However this is rather slow at inserting; it has taken several hours just to insert 500k rows. I know I could write a batch insert query, but I don't think it's such a good idea to write one for a million inserts; I guess it could be broken down into 10k at a time or something, but wanted to check if there was any better/cleaner methods available.

Brett
  • 19,449
  • 54
  • 157
  • 290
  • 1
    Create a CSV file and use `LOAD DATA INFILE` –  Apr 26 '15 at 11:27
  • @HoboSapiens I guess the issue with that is filling the CSV file with the correct data. Guess I could do it with PHP, but Excel would probably be easier. – Brett Apr 26 '15 at 11:30
  • Writing a CSV file in PHP will be an order of magnitude faster than row-by-row insertion in a database. LOAD DATA INFILE will offer a similar improvement, especially if you drop any indexes first and recreate them afterwards. It's two operations but it should be a lot faster than your current approach. –  Apr 26 '15 at 12:08

2 Answers2

0

You could use https://github.com/fzaninotto/Faker. You can make a loop and fake as many records as you need. All with random data.

Ovidiu Badita
  • 133
  • 10
  • Thank you for the link, it seems to be a great resource to fill in *correct* data, but is it faster than what I am doing now? – Brett Apr 26 '15 at 13:36
  • @Brett Actually, you want an insane amount of records inserted... this may not be faster than what you are doing. I am thinking that if you do not care about data duplication, but you care about the amount of data, you could make an insert of 5-10k records, then `INSERT INTO `sphinx` SELECT * from `sphinx`` and run that several times. You may want to choose what values you want t o insert and exclude the id's – Ovidiu Badita Apr 26 '15 at 13:59
  • I suppose that is a possible method; though tbh I'm probably leaning towards going with the csv method as I think that will be by far the fastest. – Brett Apr 26 '15 at 14:26
  • `LOAD DATA INFILE` is 20 times faster than using any kind of insert statement. But your file would be huge and I do not know how long it would take to make it and save it. Then again, you can make a smaller file and load it multiple times. – Ovidiu Badita Apr 26 '15 at 15:18
  • Haha I guess we will see :D – Brett Apr 26 '15 at 15:40
0

First of all, I suggest to measure time, after nth insert, let's say every 10000 loop index, to see any anomalies.

$start = microtime(true);
for ($i = 0; $i < $entries; $i++) {
    $sth->execute(array(
        ':some_id' => mt_rand(1,65000),
        ':title' => get_title(),
        ':content' => get_fulltext()
    ));
    if ($i % 10000 === 9999) {
        $end = microtime(true);
        echo $i . ' ' . ($end - $start) . '<br>';
    }
    $counter++;
}

Also might be problem with database type or indices or even hard disk drive. Most known types comparison: MyISAM versus InnoDB

Community
  • 1
  • 1
jskidie
  • 154
  • 6