597

I wrote a Python program that acts on a large input file to create a few million objects representing triangles. The algorithm is:

  1. read an input file
  2. process the file and create a list of triangles, represented by their vertices
  3. output the vertices in the OFF format: a list of vertices followed by a list of triangles. The triangles are represented by indices into the list of vertices

The requirement of OFF that I print out the complete list of vertices before I print out the triangles means that I have to hold the list of triangles in memory before I write the output to file. In the meanwhile I'm getting memory errors because of the sizes of the lists.

What is the best way to tell Python that I no longer need some of the data, and it can be freed?

Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
Nathan Fellman
  • 122,701
  • 101
  • 260
  • 319
  • 14
    Why not print out the triangles to an intermediate file, and read them back in again when you need them? – Alice Purcell Aug 22 '09 at 21:03
  • 4
    This question could potentially be about two quite different things. Are those errors *from the same Python process*, in which case we care about freeing memory to the Python process's heap, or are they from different processes on the system, in which case we care about freeing memory to the OS? – Charles Duffy Nov 20 '18 at 13:16
  • Can't you simply iterate over the file object using a generator and collect active vertices in a list you clear when done with that group? yield file_object.reads in a infinite loop and break on empty data. Or if line-based, just iterate over the object? Should be lazy. – GRAYgoose124 Aug 12 '23 at 05:25

10 Answers10

784

According to Python Official Documentation, you can explicitly invoke the Garbage Collector to release unreferenced memory with gc.collect(). Example:

import gc

gc.collect()

You should do that after marking what you want to discard using del:

del my_array
del my_object
gc.collect()
Havenard
  • 27,022
  • 5
  • 36
  • 62
  • 32
    Things are garbage collected frequently anyway, except in some unusual cases, so I don't think that will help much. – Lennart Regebro Aug 22 '09 at 19:31
  • 9
    Maybe not frequently enough. He can be creating these "million objects" in matter of millisecounds. – Havenard Aug 22 '09 at 19:36
  • 2
    Yeah, but the garbage collector is being run every 700th time you create an object, so it will have run thousands of times during those milliseconds. – Lennart Regebro Aug 22 '09 at 19:46
  • 37
    In general, gc.collect() is to be avoided. The garbage collector knows how to do its job. That said, if the OP is in a situation where he is suddenly deallocating a *lot* of objects (like in the millions), gc.collect may prove useful. – Jason Baker Aug 22 '09 at 19:53
  • 243
    Actually calling `gc.collect()` yourself at the end of a loop can help avoid fragmenting memory, which in turn helps keep performance up. I've seen this make a significant difference (~20% runtime IIRC) – RobM Feb 22 '11 at 18:54
  • 101
    I'm using python 3.6. Calling `gc.collect()` after loading a pandas dataframe from hdf5 (500k rows) reduced memory usage from 1.7GB to 500MB – John Jan 18 '18 at 20:30
  • 69
    I need to load and process several numpy arrays of 25GB in a system with 32GB memory. Using `del my_array` followed by `gc.collect()` after processing the array is the only way the memory is actually released and my process survives to load the next array. – David Oct 12 '18 at 08:10
  • 4
    @David, you'd have the exact same behavior calling `gc.collect()` without the `del` so long as `my_array` goes out-of-scope or has a different value or object assigned to still-in-scope names beforehand. And if you changed your code to no longer have reference loops, the memory would (in CPython) be immediately reclaimed even without the `gc.collect()`. – Charles Duffy Nov 19 '18 at 04:25
  • If you have a service up and running, memory usage will increase over time and ```gc.collect()``` will not work in most parts of your code. Finally, I realized that calling it in the outer most function, in the last line, just before my service exists, works. Another option is having a thread that wakes up and cleans up your code time to time by calling gc.collect() – Meghna Natraj Mar 12 '19 at 15:48
  • @RobM Can you explain more about the idea of using `gc.collect()` at the end of a loop? I am doing some analysis using huge numpy arrays in a for loop. Would `gc.collect()` at the end of each loop help? – seralouk Jul 22 '19 at 11:50
  • @serafeim Not only to the end of a loop but any code block really. All memory allocation that go out of scope should be released as soon as it is no longer needed, it gives your memory management a similar behavior that you would get from languages like C++ and avoids bloating your virtual memory with garbage, as well as improving the performance of the memory manager itself, as dynamically allocated data becomes less fragmented. – Havenard Jul 23 '19 at 01:12
  • I've observed that calling gc.collect() before each timing test removed the order-dependency of the test results. – user2183078 Aug 25 '19 at 20:05
  • if you ever run into an issue where a shi**y lib isn't closing file handles so attempts to delete a file will fail it might help to delete the object and explicitly call `gc.collect()`. – omni Mar 26 '20 at 16:09
  • what worked for me was `del variable` and then `gc.collect()` – busssard Jun 15 '20 at 11:44
  • 1
    In line with RobM's tip, there was formerly an "even riskier" optimization for _extremely_ tight loops, such as template generation: [**disable the GIL**](https://docs.python.org/3/library/sys.html#sys.setcheckinterval). Avoiding freeing the lock every 100 opcodes for thread election saved a fair amount of overhead, between 10% and 15% in some of my template tests. But, of course, has the downside that suddenly: every other thread will become starved during the operation optimized this way. (A fun experiment, not production-worthy IMO, and hard-deprecated as the GIL was rewritten.) – amcgregor Aug 14 '20 at 06:08
  • when loading geotiff through gdal api, {del variable and then gc.collect()} was the effective method to release memory. – Sheece Gardazi Mar 16 '21 at 13:16
  • if I have a list of file pointers that are opened, do I 1) need to del the entire list or 2) each element in the list one at a time and then call `gc.collect()`? – Charlie Parker Mar 31 '21 at 16:03
  • This doesn't work for pandas data frames...facing an issue... – Fr_nkenstien Apr 07 '21 at 11:25
  • 1
    In my case, I was running a django server with some huge data transfert (100 Mo) each call, and the garbage collector did not do the job as I had my server crashing because of memory space full. Basic usage was 100Mo of space required to run django and it could increase to 1.5 Gb. With my micro-instances, this was too much. Now, it is working fine, just garbage collecting at the start of my API call. – Nicolas Julémont Oct 14 '21 at 12:47
  • just used this to get past an Out of Memory issue where a script was reading & writing multi-gigabyte files in a loop. Garbage collector couldn't keep up, I guess. `del data` and then `gc.collect()` right after the write. – jpx Nov 24 '21 at 20:43
  • Calling the collect function after every $del array$ worked great for a memory issue I was having with processing large files – knw500 Aug 04 '22 at 21:54
  • Did not help for releasing the GDAL allocated numpy arrays. – arapEST Dec 19 '22 at 20:37
135

Unfortunately (depending on your version and release of Python) some types of objects use "free lists" which are a neat local optimization but may cause memory fragmentation, specifically by making more and more memory "earmarked" for only objects of a certain type and thereby unavailable to the "general fund".

The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.

In your use case, it seems that the best way for the subprocesses to accumulate some results and yet ensure those results are available to the main process is to use semi-temporary files (by semi-temporary I mean, NOT the kind of files that automatically go away when closed, just ordinary files that you explicitly delete when you're all done with them).

tshepang
  • 12,111
  • 21
  • 91
  • 136
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 46
    I sure would like to see a trivial example of this. – Russia Must Remove Putin Nov 25 '13 at 20:26
  • 21
    @AaronHall Trivial example [now available](http://stackoverflow.com/a/24126616/1600898), using `multiprocessing.Manager` rather than files to implement shared state. – user4815162342 Jun 11 '14 at 14:54
  • 1
    if I have a list of file pointers that are opened, do I 1) need to del the entire list or 2) each element in the list one at a time and then call `gc.collect()`? – Charlie Parker Mar 31 '21 at 16:03
  • @CharlieParker Let's say the list is `x = [obj1, obj2, ...obj20]`. To release the memory, any of the following measure can do (1) `del x` (2) `x=[]` (3) `del x[:]`. Just that for method (1), variable `x` is deleted and no longer accessible, thus the memory for the list `x` will also be released. While for methods (2) and (3), `x` is still accessible and still consumes memory. – AnnieFromTaiwan Sep 24 '21 at 09:25
  • @CharlieParker Last but not least, `gc.collect()` is not necessary because the content in the list are objects, the memory of objects will immediately be released as soon as their ref count is 0. You only need to do `gc.collect()` if the content of the list is integer or float. – AnnieFromTaiwan Sep 24 '21 at 09:46
73

The del statement might be of use, but IIRC it isn't guaranteed to free the memory. The docs are here ... and a why it isn't released is here.

I have heard people on Linux and Unix-type systems forking a python process to do some work, getting results and then killing it.

This article has notes on the Python garbage collector, but I think lack of memory control is the downside to managed memory

legoscia
  • 39,593
  • 22
  • 116
  • 167
Aiden Bell
  • 28,212
  • 4
  • 75
  • 119
  • Would IronPython and Jython be another option to avoid this problem? – Esteban Küber Aug 22 '09 at 19:23
  • @voyager: No, it wouldn't. And neither would any other language, really. The problem is that he reads in large amounts of data into a list, and the data is too large for the memory. – Lennart Regebro Aug 22 '09 at 19:27
  • 1
    It would likely be *worse* under IronPython or Jython. In those environments, you're not even guaranteed the memory will be released if nothing else is holding a reference. – Jason Baker Aug 22 '09 at 19:33
  • @voyager, yes, because the Java virtual machine looks globally for memory to free. To the JVM, Jython is nothing special. On the other hand, the JVM has its own share of drawbacks, for instance that you must declare in advance how large heap it can use. – Prof. Falken May 08 '13 at 07:41
  • It's rather awful implementation of Python garbage collector. Visual Basic 6 and VBA also have managed memory, but no one ever complained about memory not being freed there. – Anatoly Alekseev Oct 24 '20 at 01:34
  • if I have a list of file pointers that are opened, do I 1) need to del the entire list or 2) each element in the list one at a time and then call `gc.collect()`? – Charlie Parker Mar 31 '21 at 16:03
42

Python is garbage-collected, so if you reduce the size of your list, it will reclaim memory. You can also use the "del" statement to get rid of a variable completely:

biglist = [blah,blah,blah]
#...
del biglist
Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • 33
    This is and isn't true. While decreasing the size of the list allows the memory to be reclaimed, there is no guarantee when this will happen. – user142350 Aug 22 '09 at 19:39
  • 4
    No, but usually it will help. However, as I understand the question here, the problem is that he has to have so many objects that he runs out of memory before processing them all, if he reads them into a list. Deleting the list before he is done processing is unlikely to be a useful solution. ;) – Lennart Regebro Aug 22 '09 at 19:48
  • 4
    Also note that del doesn't guarantee that an object will be deleted. If there are other references to the object, it won't be freed. – Jason Baker Aug 22 '09 at 19:55
  • 3
    Wouldn't a low-memory/out-of-memory condition trigger an "emergency run" of the garbage collector? – Jeremy Friesner Aug 22 '09 at 23:02
  • 5
    will biglist = [ ] release memory? – neouyghur Jul 24 '16 at 17:55
  • 5
    yes, if the old list isn't referenced by anything else. – Ned Batchelder Jul 24 '16 at 22:09
  • what if I create a linked list and delete the head node using `del`, the other nodes will be freed up or not? since the 2nd node is not referenced by head node anymore, it should be freed. And then the 3rd node will no longer be referenced by 2nd node, so it should also get freed up? and so on.. Is it like so? – Rohit Kumar Oct 06 '18 at 05:43
  • Yes, in that scenario, all of the nodes will be freed. – Ned Batchelder Oct 06 '18 at 11:37
  • if I have a list of file pointers that are opened, do I 1) need to del the entire list or 2) each element in the list one at a time and then call `gc.collect()`? – Charlie Parker Mar 31 '21 at 16:05
30

(del can be your friend, as it marks objects as being deletable when there no other references to them. Now, often the CPython interpreter keeps this memory for later use, so your operating system might not see the "freed" memory.)

Maybe you would not run into any memory problem in the first place by using a more compact structure for your data. Thus, lists of numbers are much less memory-efficient than the format used by the standard array module or the third-party numpy module. You would save memory by putting your vertices in a NumPy 3xN array and your triangles in an N-element array.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • Eh? CPython's garbage collection is refcounting-based; it's not a periodic mark-and-sweep (as for many common JVM implementations), but instead immediately deletes something the moment its reference count hits zero. Only cycles (where refcounts would be zero but aren't because of loops in the reference tree) require periodic maintenance. `del` doesn't do anything that just reassigning a different value to all names referencing an object wouldn't. – Charles Duffy Nov 19 '18 at 04:22
  • I see where you're coming from: I will update the answer accordingly. I understand that the CPython interpreter actually works in some intermediate way: `del` frees the memory from from Python's point of view, but generally not from the C runtime library's or OS' point of view. References: https://stackoverflow.com/a/32167625/4297, http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm. – Eric O. Lebigot Nov 20 '18 at 09:47
  • Agreed as to the content of your links, but assuming the OP is talking about an error they get *from the same Python process*, the distinction between freeing memory to the process-local heap and to the OS doesn't seem likely to be relevant (as freeing to the heap makes that space available for new allocations within that Python process). And for that, `del` is equally effective with exits-from-scope, reassignments, etc. – Charles Duffy Nov 20 '18 at 13:14
27

You can't explicitly free memory. What you need to do is to make sure you don't keep references to objects. They will then be garbage collected, freeing the memory.

In your case, when you need large lists, you typically need to reorganize the code, typically using generators/iterators instead. That way you don't need to have the large lists in memory at all.

Lucas
  • 9,871
  • 5
  • 42
  • 52
Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • 3
    If this approach is feasible, then it's probably worth doing. But it should be noted that you can't do random access on iterators, which may cause problems. – Jason Baker Aug 22 '09 at 20:01
  • 2
    That's true, and if that is necessary, then accessing large data datasets randomly is likely to require some sort of database. – Lennart Regebro Aug 22 '09 at 20:22
  • You can easily use an iterator to extract a random subset of another iterator. – S.Lott Aug 22 '09 at 22:33
  • True, but then you would have to iterate through everything to get the subset, which will be very slow. – Lennart Regebro Aug 23 '09 at 06:49
15

I had a similar problem in reading a graph from a file. The processing included the computation of a 200 000x200 000 float matrix (one line at a time) that did not fit into memory. Trying to free the memory between computations using gc.collect() fixed the memory-related aspect of the problem but it resulted in performance issues: I don't know why but even though the amount of used memory remained constant, each new call to gc.collect() took some more time than the previous one. So quite quickly the garbage collecting took most of the computation time.

To fix both the memory and performance issues I switched to the use of a multithreading trick I read once somewhere (I'm sorry, I cannot find the related post anymore). Before I was reading each line of the file in a big for loop, processing it, and running gc.collect() every once and a while to free memory space. Now I call a function that reads and processes a chunk of the file in a new thread. Once the thread ends, the memory is automatically freed without the strange performance issue.

Practically it works like this:

from dask import delayed  # this module wraps the multithreading
def f(storage, index, chunk_size):  # the processing function
    # read the chunk of size chunk_size starting at index in the file
    # process it using data in storage if needed
    # append data needed for further computations  to storage 
    return storage

partial_result = delayed([])  # put into the delayed() the constructor for your data structure
# I personally use "delayed(nx.Graph())" since I am creating a networkx Graph
chunk_size = 100  # ideally you want this as big as possible while still enabling the computations to fit in memory
for index in range(0, len(file), chunk_size):
    # we indicates to dask that we will want to apply f to the parameters partial_result, index, chunk_size
    partial_result = delayed(f)(partial_result, index, chunk_size)

    # no computations are done yet !
    # dask will spawn a thread to run f(partial_result, index, chunk_size) once we call partial_result.compute()
    # passing the previous "partial_result" variable in the parameters assures a chunk will only be processed after the previous one is done
    # it also allows you to use the results of the processing of the previous chunks in the file if needed

# this launches all the computations
result = partial_result.compute()

# one thread is spawned for each "delayed" one at a time to compute its result
# dask then closes the tread, which solves the memory freeing issue
# the strange performance issue with gc.collect() is also avoided
Retzod
  • 151
  • 1
  • 3
11

Others have posted some ways that you might be able to "coax" the Python interpreter into freeing the memory (or otherwise avoid having memory problems). Chances are you should try their ideas out first. However, I feel it important to give you a direct answer to your question.

There isn't really any way to directly tell Python to free memory. The fact of that matter is that if you want that low a level of control, you're going to have to write an extension in C or C++.

That said, there are some tools to help with this:

Jason Baker
  • 192,085
  • 135
  • 376
  • 510
11

As other answers already say, Python can keep from releasing memory to the OS even if it's no longer in use by Python code (so gc.collect() doesn't free anything) especially in a long-running program. Anyway if you're on Linux you can try to release memory by invoking directly the libc function malloc_trim (man page). Something like:

import ctypes
libc = ctypes.CDLL("libc.so.6")
libc.malloc_trim(0)
Joril
  • 19,961
  • 13
  • 71
  • 88
  • how do I pass a reference to the object I want to delete to the library you suggest? I have the variable names to them do I do `lib.malloc_trim(var)`? – Charlie Parker Mar 31 '21 at 16:08
  • I'm afraid `malloc_trim` doesn't work that way (see man page). Moreover I think that libc doesn't know anything about Python variable names, so this approach is not suitable for working with variables – Joril Apr 01 '21 at 08:20
4

If you don't care about vertex reuse, you could have two output files--one for vertices and one for triangles. Then append the triangle file to the vertex file when you are done.

Nosredna
  • 83,000
  • 15
  • 95
  • 122
  • 2
    I figure I can keep only the vertices in memory and print the triangles out to a file, and then print out the vertices only at the end. However, the act of writing the triangles to a file is a huge performance drain. Is there any way to speed *that* up? – Nathan Fellman Aug 23 '09 at 18:25