31

I've been looking into using iterators for batch processing in Doctrine (http://docs.doctrine-project.org/en/2.0.x/reference/batch-processing.html). I've got a database with 20,000 images which I would like to iterate over.

I understand that using an iterator is supposed to prevent Doctrine from loading every row in memory. However the memory usage between the two examples is almost exactly the same. I am calculating the memory usage before and after using (memory_get_usage() / 1024).

$query = $this->em->createQuery('SELECT i FROM Acme\Entities\Image i');
$iterable = $query->iterate();

while (($image = $iterable->next()) !== false) {
    // Do something here!
}

Memory usage for the iterator.

Memory usage before: 2823.36328125 KB
Memory usage after: 50965.3125 KB

This second example loads the entire result set into memory using the findAll method.

$images = $this->em->getRepository('Acme\Entities\Image')->findAll();

Memory usage for findAll.

Memory usage before: 2822.828125 KB
Memory usage after: 51329.03125 KB

6 Answers6

74

Batch processing with doctrine is trickier than it seems, even with the help of iterate() and IterableResult.

Just as you expected greatest benefit of IterableResult is that it does not load all of the elements into memory, and the second benefit is that it doesn't hold references to the entities you load, thus IterableResult doesn't prevent GC from freeing memory from your entity.

However there's another object Doctrine's EntityManager (more specifically UnitOfWork) which holds all the references to each object which you queried explicitly or implicitly (EAGER associations).

In simple words, whenever you get any entity(ies) returned by findAll() findOneBy() even through DQL queries and also IterableResult, then a reference to each of those entities is saved inside of doctrine. The reference is simply stored in an assoc array, here's pseudocode: $identityMap['Acme\Entities\Image'][0] = $image0;

So because upon each iteration of your loop, your previous images (despite not being present in the loop's scope or IterableResult's scope) are still present inside of this identityMap, GC cannot clean them and your memory consumption is the same as when you were calling findAll().

Now let's go through the code and see what is actually happening

$query = $this->em->createQuery('SELECT i FROM Acme\Entities\Image i');  
// here doctrine only creates Query object, no db access here

$iterable = $query->iterate();
// unlike findAll(), upon this call no db access happens.  
// Here the Query object is simply wrapped in an Iterator  

while (($image_row = $iterable->next()) !== false) {  
    // now upon the first call to next() the DB WILL BE ACCESSED FOR THE FIRST TIME
    // the first resulting row will be returned
    // row will be hydrated into Image object
    // ----> REFERENCE OF OBJECT WILL BE SAVED INSIDE $identityMap <----
    // the row will be returned to you via next()

    // to access actual Image object, you need to take [0]th element of the array                            


     $image = $image_row[0];
    // Do something here!
     write_image_data_to_file($image,'myimage.data.bin');
    
    //now as the loop ends, the variables $image (and $image_row) will go out of scope 
    // and from what we see should be ready for GC
    // however because reference to this specific image object is still held
    // by the EntityManager (inside of $identityMap), GC will NOT clean it 
}
// and by the end of your loop you will consume as much memory
// as you would have by using `findAll()`.

So the first solution is to actually tell Doctrine EntityManager to detach the object from the $identityMap. I also replaced while loop to foreach to make it more readable.

foreach($iterable as $image_row){
    $image = $image_row[0]; 

    // do something with the image
    write_image_data_to_file($image);
    
    $entity_manager->detach($image);
    // this line will tell doctrine to remove the _reference_to_the_object_ 
    // from identity map. And thus object will be ready for GC
}

However the example above has few flaws, even though it is featured in the doctrine's documentation on batch processing. It works well, in case your entity Image isn't performing EAGER load for any of it's associations. But if you're EAGERly loading any of the associations eg. :

/*
  @ORM\Entity
*/
class Image {
  
  /* 
    @ORM\Column(type="integer")
    @ORM\Id 
   */
  private $id;
  
  /*
    @ORM\Column(type="string")
  */
  private $imageName;

  /*
   @ORM\ManyToOne(targetEntity="Acme\Entity\User", fetch="EAGER")
   This association will be automatically (EAGERly) loaded by doctrine
   every time you query from db Image entity. Whether by findXXX(),DQL or iterate()
  */
  private $owner;

  // getters/setters left out for clarity
}

So if we use same piece of the code as above, upon

foreach($iterable as $image_row){
    $image = $image_row[0]; 
    // here becuase of EAGER loading, we already have in memory owner entity
    // which can be accessed via $image->getOwner() 

    // do something with the image
    write_image_data_to_file($image);
    
    $entity_manager->detach($image);
    // here we detach Image entity, but `$owner` `User` entity is still
    // referenced in the doctrine's `$identityMap`. Thus we are leaking memory still.
   
}

The possible solution can be to use EntityManager::clear() instead or EntityManager::detach() which will clear COMPLETELY the identity map.

foreach($iterable as $image_row){
    $image = $image_row[0]; 
    // here becuase of EAGER loading, we already have in memory owner entity
    // which can be accessed via $image->getOwner() 

    // do something with the image
    write_image_data_to_file($image);
    
    $entity_manager->clear();
    // now ``$identityMap` will be cleared of ALL entities it has
    // the `Image` the `User` loaded in this loop iteration and as as
    // SIDE EFFECT all OTHER Entities which may have been loaded by you
    // earlier. Thus you when you start this loop you must NOT rely
    // on any entities you have `persist()`ed or `remove()`ed 
    // all changes since the last `flush()` will be lost.
   
}
double-beep
  • 5,031
  • 17
  • 33
  • 41
Dimitry K
  • 2,236
  • 1
  • 28
  • 37
  • Thank you for detailed explanation. I will add from my experence, that `$iterable = $query->iterate();` can consume tons of memory as well because of query buffering. http://php.net/manual/en/mysqlinfo.concepts.buffering.php – gatisl Apr 06 '15 at 08:57
  • @dimitri_k Thank you for details, btw is there a way to use HYDRATE_SCALAR with iterator ? If yes then would that be another alternative to deal with this `identityMap` issue ? – Stphane Jun 15 '16 at 15:31
  • @Stphane that's an interesting suggestion to use HYDRATE_SCALAR. Originally I didn't think of it as viable option because when you use HYDRATE_SCALAR - you wouldn't get objects returned in a nice traversable object graph. You will NOT be able to follow relationship from one object to another like in `$image->getUser()->getProfession()->getSalary()`. They would be returned as big union array (akin to `SELECT ... JOIN`). HYDRATE_SCALAR may work for you when you have flat object relationships (1 or 2 levels). I am not sure how it works when you have more relationships added to object later... – Dimitry K Jun 15 '16 at 15:55
  • But I looked at the Doctrine source and `ArrayHydrator` doesn't seem to be calling `$this->registerManaged(..)`. Which probably means that you shouldn't be getting memory leaks. https://github.com/doctrine/doctrine2/blob/master/lib/Doctrine/ORM/Internal/Hydration/ArrayHydrator.php – Dimitry K Jun 15 '16 at 15:56
  • `Unlike findAll(), upon this call no db access happens` Completely **false** statement. The data is fully fetched but hydrated row-by-row. See the code in @BlocksByLukas's answer. – Samuel Katz Feb 02 '20 at 03:15
  • thank you @SalmanvonAbbas for being so scrupulous to my post. From what I remember from 5 yrs ago when I wrote it, the problem was `getIterator() / iterate()` method within ORM classes (couldn't quickly find which file it was). Which had lazy implementation. Whereas the answer of @BlocksByLukas you've referred to shows implementation not of ORM, but of the underlying driver of mysqli `namespace Doctrine\DBAL\Driver\Mysqli\MysqliStatement` (and it uses SQL not DQL). `MysqliStatement` is part of Doctrine project (as a driver but not ORM). That's what I can remember off the top of my head... – Dimitry K Feb 04 '20 at 11:07
  • 1
    @SalmanvonAbbas also the point i was trying to make with call to `->iterate()` was that it simply returns wrapper object (there's no actual physical request to mysql or mysql driver made yet). And only upon _request of first item from iterator_ that process of query is initiated. (Thus this behaviour doesn't contradict with what @BlocksByLukas stated, assuming his research results were right and apply here). ps. also in my previous comment i mistakenly mentioned two methods `getIterator()/ iterate()`, but it should correctly read only one `iterate()` (can't edit that comment anymore). – Dimitry K Feb 04 '20 at 11:16
  • 'detach' is deprecated. – Anton Duzenko Oct 31 '22 at 11:52
  • @AntonDuzenko so I wonder what's the new approach? From ([Doctrine Docs about Batch processing](https://www.doctrine-project.org/projects/doctrine-orm/en/latest/reference/batch-processing.html)) we can read the following `An ORM tool is not primarily well-suited for mass inserts, updates or deletions. Every RDBMS has its own, most effective way of dealing with such operations......` seems that the answer is : "Do not use Doctrine for batch processing" :S You can also find this [Doctrine Issue on Deprecation of EntityRepository#clear()](https://github.com/doctrine/orm/issues/7922) helpful – Dimitry K Nov 06 '22 at 14:52
  • At least we do want to use it for batch select (many thousands of nested entities). EntityManager::clear() seems to do the job BUT it still wastes a lot of CPU time on this cache nonsense. I'm generally displeased with Doctrine intenal sloppiness. @DimitryK If you want a job to be done properly, do it yourself ;) – Anton Duzenko Nov 07 '22 at 07:58
7

If you combine doctrine iterate() with batching strategy, you should be able to iterate over large records.

For example:


$batchSize = 1000;
$numberOfRecordsPerPage = 5000;

$totalRecords = $queryBuilder->select('count(u.id)')
            ->from('SELECT i FROM Acme\Entities\Image i')
            ->getQuery()
            ->getSingleScalarResult();   //Get total records to iterate on

        $totalProcessed = 0;

        $processing = true;

        while ($processing) {
            $query = $entityManager->createQuery('SELECT i FROM Acme\Entities\Image i')
                ->setMaxResults($numberOfRecordsPerPage) //Maximum records to fetch at a time
                ->setFirstResult($totalProcessed);
          
             $iterableResult = $query->iterate();
          
            while (($row = $iterableResult->next()) !== false) {
                $image = $row[0];
                $image->updateSomethingImportant();
              
                 if (($totalProcessed % $batchSize ) === 0) {
                    $entityManager->flush();
                    $entityManager->clear();
                }
                $totalProcessed++;
            }
            if ($totalProcessed === $totalRecords) {
                break;
            }
        }

    $entityManager->flush();


See Iterating large data in doctrine 2

biplob
  • 1,252
  • 1
  • 11
  • 29
Samuel James
  • 1,528
  • 16
  • 15
5

I strongly believe the batch processing with Doctrine or any kind of iterations with MySQL (PDO or mysqli) are just an illusion.

@dimitri-k provided a nice explanation especially about unit of work. The problem is the miss leading: "$query->iterate()" which doesn't really iterate over the data source. It's just an \Traversable wrapper around already fully fetched data source.

An example demonstrating that even removing Doctrine abstraction layer completely from the picture, we will still run into memory issues:

echo 'Starting with memory usage: ' . memory_get_usage(true) / 1024 / 1024 . " MB \n";

$pdo  = new \PDO("mysql:dbname=DBNAME;host=HOST", "USER", "PW");
$stmt = $pdo->prepare('SELECT * FROM my_big_table LIMIT 100000');
$stmt->execute();

while ($rawCampaign = $stmt->fetch()) {
    // echo $rawCampaign['id'] . "\n";
}

echo 'Ending with memory usage: ' . memory_get_usage(true) / 1024 / 1024 . " MB \n";

Output:

Starting with memory usage: 6 MB 
Ending with memory usage: 109.46875 MB

Here, the disappointing getIterator() method:

namespace Doctrine\DBAL\Driver\Mysqli\MysqliStatement

/**
 * {@inheritdoc}
 */
public function getIterator()
{
    $data = $this->fetchAll();

    return new \ArrayIterator($data);
}

You can use my little library to actually stream heavy tables using PHP Doctrine or DQL or just pure SQL. However you find appropriate: https://github.com/EnchanterIO/remote-collection-stream

Lukas Lukac
  • 7,766
  • 10
  • 65
  • 75
  • 4
    you didn't explain why your library is better tho –  Feb 21 '18 at 11:27
  • @EnchanterIO How does your library improve the issue? https://stackoverflow.com/a/36201665 – Enrico Stahn Dec 20 '18 at 02:47
  • Thank you @blocksbylukas for the interesting idea, but I don't have time now to understand it properly, as I don't use Doctrine that much anymore. – Dimitry K Feb 04 '20 at 11:13
  • I think you are wrong regarding batch processing and Doctrine, at least nowadays. It is not simply an array iterator over already fetched data. I was just able to reduce the memory consumption alot by implementing the iterator pattern as outlined by others and in the docs. – Richard Kiefer Feb 18 '21 at 16:58
  • But what kind of queries happened behind the scenes? How much data did you fetch in memory? Surely Doctrine is doing some fantastic tricks and I don't remember exactly what was the issue 3 years ago but I think I wrote this answer because I was under the premise that the Iterator is going to somehow "stream" the rows from a table, and I believe it instead fetched all the rows and THEN did PHP optimization on Entities. It's probably different now, 3 years later. That little library (didn't use it since then honestly) is just very explicit about what it does: SQL Offsets + Iterator for looping. – Lukas Lukac Feb 18 '21 at 20:16
  • Oh I miss PHP! Feeling nostalgic now, this was one of the last things I did in PHP. – Lukas Lukac Feb 18 '21 at 20:18
2

tl;dr;

When running the command use --no-debug or set the Sql logger to null to prevent it saving all queries it runs.

Every then and now use EntityManager::clear(), memory leak will go down to almost zero.

max4ever
  • 11,909
  • 13
  • 77
  • 115
1

For batching I like to use yield this way:

ImageRepository.php

<?php

class ImageRepository extends ServiceEntityRepository
{
    // The rest of your code...

    public function findAllImages(): \Iterator
    {
        // Find the total amount of images.
        $total_images = $this->createQueryBuilder('i')
           ->select('COUNT(*)')
           ->getQuery()
           ->getSingleScalarResult();

        $processed_records = 0;

        while (true) {
            $query = $this->createQueryBuilder('i')
                // Process batch of 100 results, this can
                // be whatever amount you can fit in memory,
                ->setMaxResults(100)
                ->setFirstResult($processed_records)
                ->getQuery()
                ->getResult();

            $processed_records += count($query);

            // With yield you are not storing the full
            // amount of images in memory.    
            yield from $query;

            if ($processed_records >= $total_images) {
                break;
            }
        }
    }
}

then you can use it somewhere

<?php

class ImageCommand extends Command
{
    // The rest of your code...

    public function processAllImages(ImageRepository $repository): void
    {
        $images = $repository->findAllImages();

        foreach ($images as $image) {
            $this->processImage($image);
        }
    }
}
Abraham
  • 8,525
  • 5
  • 47
  • 53
-5

The results could be similar because the db client may be allocating additional memory you cant see. Also your code uses the 'IterableResult' which is returned form '$query->iterate()'; this allows for processing large results without memory problems. Just quick thoughts hope it helped a little.