5

I have several thousand records (stored in in a table in a MYSQL table) that I need to "batch process." All of the records contain a large JSON. In some cases, the JSON is over 1MB (yes, my DB is well over 1GB).

I have a function that grabs a record, decodes the JSON, changes some data, re-encodes the PHP array back to a JSON, and saves it back to the db. Pretty simple. FWIW, this is within the context of a CakePHP app.

Given an array of ID's, I'm attempting to do something like this (very simple mock code):

foreach ($ids as $id) {
    $this->Model->id = $id;
    $data = $this->Model->read();
    $newData = processData($data);
    $this->Model->save($newData);
}

The issue is that, very quickly, PHP runs out of memory. When running a foreach like this, it's almost as if PHP moves from one record to the next, without releasing the memory required for the preceding operations.

Is there anyway to run a loop in such a way that memory is freed before moving on to the next iteration of the loop, so that I can actually process the massive amount of data?

Edit: Adding more code. This function takes my JSON, converts it to a PHP array, does some manipulation (namely, reconfiguring data based on what's present in another array), and replacing values in the the original array. The JSON is many layers deep, hence the extremely long foreach loops.

function processData($theData) {
    $toConvert = json_decode($theData['Program']['data'], $assoc = true);
    foreach($toConvert['cycles'] as $cycle => $val) {
        foreach($toConvert['cycles'][$cycle]['days'] as $day => $val) {
            foreach($toConvert['cycles'][$cycle]['days'][$day]['sections'] as $section => $val) {
                foreach($toConvert['cycles'][$cycle]['days'][$day]['sections'] as $section => $val) {
                    foreach($toConvert['cycles'][$cycle]['days'][$day]['sections'][$section]['exercises'] as $exercise => $val) {
                        if (isset($toConvert['cycles'][$cycle]['days'][$day]['sections'][$section]['exercises'][$exercise]['selectedFolder'])) {
                            $folderName = $toConvert['cycles'][$cycle]['days'][$day]['sections'][$section]['exercises'][$exercise]['selectedFolder']['folderName'];
                            if ( isset($newFolderList['Folders'][$folderName]) ) {
                                $toConvert['cycles'][$cycle]['days'][$day]['sections'][$section]['exercises'][$exercise]['selectedFolder'] = $newFolderList['Folders'][$folderName]['id'];
                            }
                        }
                        if (isset($toConvert['cycles'][$cycle]['days'][$day]['sections'][$section]['exercises'][$exercise]['selectedFile'])) {
                            $fileName = basename($toConvert['cycles'][$cycle]['days'][$day]['sections'][$section]['exercises'][$exercise]['selectedFile']['fileURL']);
                            if ( isset($newFolderList['Exercises'][$fileName]) ) {
                                $toConvert['cycles'][$cycle]['days'][$day]['sections'][$section]['exercises'][$exercise]['selectedFile'] = $newFolderList['Exercises'][$fileName]['id'];
                            }
                        }
                    }
                }
            }
        }
    }
    return $toConvert;
}

Model->read() essentially just tells Cake to pull a record from the db, and returns it in an array. There's plenty of stuff that's happening behind the scenes, someone more knowledgable would have to explain that.

Benjamin Allison
  • 2,134
  • 3
  • 30
  • 55
  • You could sleep at the end of each loop, and if you're running PHP 5.3 or greater you can (try to) call the garbage collector. – Nick Pickering Apr 19 '13 at 01:32
  • How would sleep help? – John Carter Apr 19 '13 at 01:32
  • @therefromhere give the gc more time to kick in, or finish what it's doing. – Nick Pickering Apr 19 '13 at 01:33
  • @NicholasPickering: What about simply iterating through each record of the source table individually, rather than relying on a `foreach` collection? – Robert Harvey Apr 19 '13 at 01:34
  • @NicholasPickering I'm not familiar with Cake, but I doubt that will help. If you've got a big object waiting for GC, unsetting it migt though.. See a similar situation in Symfony/Doctrine: http://stackoverflow.com/questions/2097744/php-symfony-doctrine-memory-leak?lq=1 – John Carter Apr 19 '13 at 01:36
  • @RobertHarvey Is he able to save all of the objects back to the database without a loop? – Nick Pickering Apr 19 '13 at 01:37
  • @therefromhere He's parsing JSON that's 1MB large. That's a good chunk of memory sitting there doing nothing once its back in the DB. – Nick Pickering Apr 19 '13 at 01:39
  • Even if you fix this you will probably find that PHP combined with ORM leaks *some* memory in any case, so I'd suggest writing your script in such a way that it fails gracefully (eg using DB transactions) and re-running it will carry on from where it died. – John Carter Apr 19 '13 at 01:40
  • I'm not sure how I could really do it without a loop. It makes the most sense to process one record at a time, and the only way I can foresee doing that (apart from spending weeks doing it manually :) is to use a loop. Each record is read, processed, and saved with in the loop; I'm not trying to read or save more than one record at a time. – Benjamin Allison Apr 19 '13 at 01:41
  • @therefromhere Good idea. I'd thought some kind of "pick up from where we left off" fallback might be necessary. The idea of hitting refresh a few dozen times isn't all that bad! – Benjamin Allison Apr 19 '13 at 01:42
  • Break away from the code you've got for a second and think of the simplest way to do this. Pull all of the records in one query. Loop through all of the records, generating a single new update query. Then run your new update query. That should cut down drastically on memory usage. – Nick Pickering Apr 19 '13 at 01:43
  • 1
    @NicholasPickering I don't see how trying to work with over 1GB worth of data at once helps me with the memory issue. Maybe I'm missing something? – Benjamin Allison Apr 19 '13 at 01:47
  • 1
    @BenjaminAllison Good point. I'm stupid, Carry on. – Nick Pickering Apr 19 '13 at 01:56
  • What does `processData()` look like? (Some information about `Model` might be helpful too. Specifically `Model->read()`) – svidgen Apr 19 '13 at 01:56
  • At first glance, your `processData()` seems fine. I'd be interested to see how a eliminating everything except `$this->Model->id = $id; $this->Model->read();` from the loop affects memory over time. – svidgen Apr 19 '13 at 02:21

1 Answers1

2

The first step I would do is make sure everything is passed by reference.

Eg,

foreach ($ids as $id) {
processData($data);
}

function processData(&$d){}

http://php.net/manual/en/language.references.pass.php

BenMorel
  • 34,448
  • 50
  • 182
  • 322
Jase Whatson
  • 4,179
  • 5
  • 36
  • 45