How do I loop through a large dataset in python without getting a MemoryError?

Question

I have a large series of raster datasets representing monthly rainfall over several decades. I've written a script in Python that loops over each raster and does the following:

Converts the raster to a numpy masked array,
Performs lots of array algebra to calculate a new water level,
Writes the result to an output raster.
Repeats

The script is just a long list of array algebra equations enclosed by a loop statement.

Everything works well if I just run the script on a small part of my data (say 20 years' worth), but if I try to process the whole lot I get a MemoryError. The error doesn't give any more information than that (except it highlights the line in the code at which Python gave up).

Unfortunately, I can't easily process my data in chunks - I really need to be able to do the whole lot at once. This is because, at the end of each iteration, the output (water level) is fed back into the next iteration as the start point.

My understanding of programming is very basic at present, but I thought that all of my objects would just be overwritten on each loop. I (stupidly?) assumed that if the code managed to loop successfully once then it should be able to loop indefinitely without using up more and more memory.

I've tried reading various bits of documentation and have discovered something called the "Garbage Collector", but I feel like I'm getting out of my depth and my brain's melting! Can anyone offer some basic insight into what actually happens to objects in memory when my code loops? Is there a way of freeing-up memory at the end of each loop, or is there some more "Pythonic" way of coding which avoids this problem altogether?

I don't think people will be able to help you too much without seeing some source code. — GWW, Nov 04 '10 at 15:20

score 5 · Answer 1 · edited May 23 '17 at 12:19

You don't need to concern youself with memory management, especially not with the garbage collector that has a very specific task that you most likely don't even use. Python will always collect the memory it can and reuse it.

There are just two reasons for your problem: Either the data you try to load is too much to fit into memory or your calculations store data somewhere (a list, dict, something persistent between iterations) and that storage grows and grows. Memory profilers can help finding that.

score 4 · Answer 2 · answered Nov 04 '10 at 15:23

a quick way to "force" the garbage collector to clean the temporary loop-only objects is the del statement:

for obj in list_of_obj:   
    data = obj.getData()  
    do_stuff(data)   
    del data

this forces the interpreter to delete and free the temporary objects. NOTE: this does not make sure the program does not leak or consume memory in other parts of the computation, it's just a quick check

How do I loop through a large dataset in python without getting a MemoryError?

2 Answers2

Linked