1

I'm working on a Symfony 2 based project and need to regularly import 1.5 million products from a .csv document using a cron task.

The import currently looks like this:

    ...

    Propel::disableInstancePooling();

    $fp = fopen($file, 'r');
    while ( !feof($fp) ) {
        $line = fgets($fp, 2048);
        $data = str_getcsv($line, "\t");

        $product = new Product();
        $product->setId($data[0]);
        $product->setTitle($data[1]);
        ...
        $product->save();
    }

    ...

However after about 5 minutes Apache hits its 512MB memory allocation and throws the following error:

Fatal error: Allowed memory size of 536870912 bytes exhausted (tried to allocate 78 bytes) in /var/www/vhosts/myapp.local/app/cache/dev/classes.php on line 9451

I've disabled Propel's instance pooling (this has worked in the past in previous Propel powered imports in other frameworks) and the problem is still occuring so I'm wondering whether Symfony 2 is performing any kind of caching (based on the error being thrown).

Does anyone have any experience on performing large imports in Symfony 2 that can point me in the right direction?

Joseph Woodward
  • 9,191
  • 5
  • 44
  • 63

2 Answers2

1

Oh boy, I was there like 6 months ago, one little word for you: transactions.

Cut the csv in pieces, like in packages of 10000 rows more or less and execute 1 transaction per package, that way you avoid the overhead of writing 1.5 million times to db and write like 150.

Hope that guide you to a good solution, I could not lower my time from 30 seconds, maybe you should consider an offline task processor for these kind of io/cpu bound tasks.

At the moment this page was my bible: http://propelorm.org/documentation/06-transactions.html

EDIT

Actually I thought this was some kind of import action required by the site, if it's a one time thing that you "the developer" need to do I would go with Xnoise advise.

Community
  • 1
  • 1
MGP
  • 2,981
  • 35
  • 34
  • Thank you for your reply. I was thinking of writing something in C# to do it offline but ideally I need it to be run every 2 or 3 days and would like to avoid the hassle of having to manually process the data. – Joseph Woodward Dec 27 '12 at 20:51
  • 2
    Python would do it without too much hassle, the csv module is nice, check this out. http://stackoverflow.com/questions/10154633/load-csv-data-into-mysql-in-python – MGP Dec 28 '12 at 14:33
1

Actually, this import should be done on mysql level directly (mysql can read data from csv files) unless it is very complex. The other approach is to read the file incrementally, generate a sql file and importing it into the database after that natively. There is no reason for such large imports to be handled by a web application directly, because you will run out of memory.

Make sure that you don't load the entire csv into memory at once.

Additional solution is to give php more memory in php.ini, and hope for the best. Php is not the best language for batch processing large amounts of data.

Xnoise
  • 502
  • 2
  • 9
  • Alright, I will have a look into importing it on a MySQL level. This size of data import is new territory to me. You rightly mention that PHP is not the best language for patch processing this level of data. Would you suggest I look at Python as an alternative? Thank you. – Joseph Woodward Dec 27 '12 at 20:41
  • It depends. As long as you preprocess the data (if you don't receive it in a native format like sql statements for mysql) and you are careful at resource consumption (so you don't run out of memory), you can convert the data in a format that mysql understands. One approach would be to read first x lines from the file, dump the resulting sqls into a file. Continue reading like this until you finished the entire file. Once you have the sql, leave mysql client do it's job. It will do it faster than what you do. But beware, test your scripts. I've seen too much data loss because of a poor script. – Xnoise Dec 27 '12 at 21:18
  • Honestly i am not sure if python is better, or c is better, or any other language. You just need to pay attention at the resources and use them wisely. The mysql client will do the rest most times. – Xnoise Dec 27 '12 at 21:19
  • I used MySQL's load data infile function which seems to have done the job nicely! I cut a 10+ minute import down into a 10 second one! – Joseph Woodward Dec 28 '12 at 20:25