0
  1. I have a csv file, containing millions of email addresses which I want to upload fast into a mysql database with PHP.
  2. Right now I'm using a single threaded program which takes too much time to upload.

//get the csv file
$file = $_FILES['csv']['tmp_name'];
$handle = fopen($file,"r");

//loop through the csv file and insert into database
do {
    if ($data[0]) {
        $expression = "/^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$/";
        if (preg_match($expression, $data[0])) {
            $query=mysql_query("SELECT * FROM `postfix`.`recipient_access` where recipient='".$data[0]."'");
            mysql_query("SET NAMES utf8");
            $fetch=mysql_fetch_array($query);
            if($fetch['recipient']!=$data[0]){
                $query=mysql_query("INSERT INTO `postfix`.`recipient_access`(`recipient`, `note`) VALUES('".addslashes($data[0])."','".$_POST['note']."')");
            }
        }
    }
} while ($data = fgetcsv($handle,1000,",","'"));
Christian
  • 27,509
  • 17
  • 111
  • 155
  • Do you need to do this frequently, like every day? If not, I'd just leave things as they are and wait. For any threaded solution you'd almost definitely have to go outside PHP. Make sure your tables are properly indexed though, so your select runs at optimal speed. You'll `SET NAMES` only once, take that out of the loop. – Pekka Nov 30 '13 at 20:13
  • Why are you processing it in PHP instead of MySQL? – Ignacio Vazquez-Abrams Nov 30 '13 at 20:13
  • Its working, but i need multithread program to upload file ten times or more fater using with core php – Umesh Vishwakarma Nov 30 '13 at 20:14
  • Well, I need a million dollars. But I'm not likely to get them anytime soon. I'm fairly sure the same applies to this requirement :) – Pekka Nov 30 '13 at 20:14
  • I have a thousands of file and it will increasing day by day, its not good to go through the MySQL. – Umesh Vishwakarma Nov 30 '13 at 20:18
  • I have a another idea, to break my csv file in many parts at the time of uploading a file. After this files will be uploaded by many loops. But how can i divide or break a file in many parts with PHP – Umesh Vishwakarma Nov 30 '13 at 20:27
  • thousands of rows in your one file, or many thousands of files being uploaded daily? Maybe try reducing the data on these file(s)? if these files contain images as in image data, maybe use image URLs instead? – KarlosFontana Nov 30 '13 at 20:41
  • Umesh, this process is highly optimizable but involve structural modifications. – Sebas Nov 30 '13 at 20:52

4 Answers4

1

First of all, I can't stress enough; fix your indentation - it will make life easier for everyone.

Secondly, the answer depends a lot on the actual bottlenecks you are encountering:

  • Regular expressions are very slow, especially when they're in a loop.
  • Databases tend to either work well for WRITES or for READS but not BOTH: try decreasing the amount of queries beforehand.
  • It stands to reason that the less PHP code in your loop, the faster it will work. Consider decreasing conditions (for instance).
  • For the record, your code is not safe against mysql injection: filter $_POST before hand [*]
  • [*] speaking of which, it's faster to access a variable than the index of an array, like $_POST.
  • You can simulate multithreading by having your main program split the huge CSV file into a smaller one and run each file into a different process.

common.php

class FileLineFinder {
    protected $handle, $length, $curpos;
    public function __construct($file){
        $handle = fopen($file, 'r');
        $length = strlen(PHP_EOL);
    }
    public function next_line(){
        while(!feof($this->handle)){
            $b = fread($this->handle, $this->length);
            $this->curpos += $this->length;
            if ($b == PHP_EOL) return $this->curpos;
        }
        return false;
    }
    public function skip_lines($count){
        for($i = 0; $i < $count; $i++)
            $this->next_line();
    }
    public function __destruct(){
        fclose($this->handle);
    }
}

function exec_async($cmd, $outfile, $pidfile){
    exec(sprintf("%s > %s 2>&1 & echo $! >> %s", $cmd, $outfile, $pidfile));
}

main.php

require('common.php');

$maxlines = 200;      // maximum lines subtask will be processing at a time
$note = $_POST['note'];
$file = $_FILES['csv']['tmp_name'];
$outdir = dirname(__FILE__) . DIRECTORY_SEPARATOR . 'out' . DIRECTORY_SEPARATOR;

//make sure our output directory exists
if(!is_dir($outdir))
    if(!mkdir($outdir, 0755, true))
        die('Cannot create output directory: '.$outdir);

// run a task for each chunk of lines in the csv file
$i = 0; $pos = 0;
$l = new FileLineFinder($file);
do {
    $i++;
    exec_async(
        'php -f sub.php -- '.$pos.' '.$maxlines.' '.escapeshellarg($file).' '.escapeshellarg($note),
        $outdir.'proc'.$i.'.log',
        $outdir.'proc'.$i.'.pid'
    );
    $l->skip_lines($maxlines);
} while($pos = $l->next_line());

// wait for each task to finish
do {
    $tasks = count(glob($outdir.'proc*.pid'));
    echo 'Remaining Tasks: '.$tasks.PHP_EOL;
} while ($tasks > 0);
echo 'Finished!'.PHP_EOL;

sub.php

require('common.php');

$start = (int)$argv[1];
$count = (int)$argv[2];
$file = $argv[3];
$note = mysql_real_escape_string($argv[4]);
$lines = 0;

$handle = fopen($file, 'r');
fseek($handle, $start, SEEK_SET);

$expression = "/^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+(\.[a-z0-9-]+)*(\.[a-z]{2,3})$/";

mysql_query('SET NAMES utf8');

//loop through the csv file and insert into database
do {
    $lines++;
    if ($data[0]) {
        if (preg_match($expression, $data[0])) {
            $query = mysql_query('SELECT * FROM `postfix`.`recipient_access` where recipient="'.$data[0].'"');
            $fetch = mysql_fetch_array($query);
            if($fetch['recipient'] != $data[0]){
                $query = mysql_query('INSERT INTO `postfix`.`recipient_access`(`recipient`, `note`) VALUES("'.$data[0].'","'.$note.'")');
            }
        }
    }
} while (($data = fgetcsv($handle, 1000, ',', '\'')) && ($lines < $count));

Credits

Community
  • 1
  • 1
Christian
  • 27,509
  • 17
  • 111
  • 155
0

The most pressing thing to do is to make sure your database is properly indexed so the lookup query you do for every row is as fast as possible.

Other than that, there simply isn't that much you can do. For a multithreaded solution, you'll have to go outside PHP.

You could also just import the CSV file in mySQL, and then weed out the superfluous data using your PHP script - that is likely to be the fastest way.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
0

Just a general suggestion: The key to speed up any program is to know which part take most of the time.

And then figure out how to reduce it. Sometimes you will be very surprised by the actual result.

btw, I don't think multithreading would solve your your problem.

tly_alex
  • 103
  • 1
  • 9
0

Put the whole loop inside an SQL transaction. That will speed things up by an order of magnitude.

Philip Sheard
  • 5,789
  • 5
  • 27
  • 42