Why do garbage collectors wait before deallocating?

Question

I have a "why does it work that way?" question about garbage collection (any/all implementations: Java, Python, CLR, etc.). Garbage collectors deallocate an object when it is no longer in any scope; the number of references pointing to it is zero. It seems to me that a framework could deallocate as soon as the number of references reaches zero, but all implementations I've encountered wait a while and then deallocate many objects at a time. My question is, why?

I'm assuming that the framework keeps an integer for each object (which I think Python does, because you have to call PyINCREF and PyDECREF when writing extension modules for it in C; presumably these functions modify a real counter somewhere). If so, then it shouldn't take any more CPU time to eliminate the object the moment it goes out of scope. If it takes x nanoseconds per object now, then it would take x nanoseconds per object later, right?

If my assumption is wrong and there is no integer associated with each object, then I understand why garbage collection waits: it would have to walk the graph of references to determine the status of each object, and that calculation takes time. Such a method would consume less memory than the explicit reference-count method, but I'm astonished that it's quicker or is the preferred method for other reasons. It sounds like a lot of work.

From a programming point of view, it would be nice if objects deallocated immediately after they go out of scope. Not only could we rely on destructors being executed when we want them to be (one of the Python gotchas is that __del__ is not called at a predictable time), but it would become much easier to memory-profile a program. Here's an example of how much confusion this causes. To my mind, the benefits of programming in a deallocate-right-away framework are so great that there must be some good reason why all the implementations I've heard of wait before deallocating. What is that benefit?

Note: if the walk over the graph of references is only needed to identify circular references (a pure reference count can't), then why not a hybrid approach? Deallocate objects as soon as their reference count hits zero and then also do periodic sweeps to look for circular references. Programmers working in such a framework would have a performance/determinism reason to stick to non-circular references as much as is feasible. It's often feasible (e.g. all data are in the form of JSON objects with no pointers to parents). Is this how any popular garbage collectors work?

For reference, Java doesn't do reference counting. – SimonC Jul 15 '13 at 04:03 — SimonC, Jul 15 '13 at 04:03

Daniel Pryden · Accepted Answer · 2013-07-16T03:28:36.967

To start with, a point of terminology: "garbage collection" means different things to different people, and some GC schemes are more sophisticated than others. Some people consider reference counting to be a form of GC, but personally I consider "true GC" to be distinct from reference counting.

With refcounts, there is an integer tracking the number of references, and you can trigger deallocation immediately when the refcount hits zero. This us how the CPython implementation works, and how most varieties of C++ smart pointers work. The CPython implementation adds a mark/sweep GC as a backup, so it's very much like the hybrid design you describe.

But refcounting is actually a pretty terrible solution, since it incurs a (relatively) expensive memory write (plus a memory barrier and/or lock, to ensure thread safety) every time a reference is passed, which happens quite a lot. In imperative languages like C++ it's possible (just difficult) to manage memory ownership through macros and coding conventions, but in functional languages like Lisp it becomes well-nigh impossible, because memory allocation usually happens implicitly due to local variable capture in a closure.

So it should come as no surprise that the first step toward a modern GC was invented for Lisp. It was called the "twospace allocator" or "twospace collector" and it worked exactly like it sounds: it divided allocatable memory (the "heap") into two spaces. Every new object was allocated out of the first space until it got too full, at which point allocation would stop and the runtime would walk the reference graph and copy only the live (still referenced) objects to the second space. After the live objects were copied, the first space would be marked empty, and allocation would resume, allocating new objects from the second space, until it got too full, at which point the live objects would be copied back to the first space and the process would start all over again.

The advantage of the twospace collector is that, instead of doing O(N) work, where N is the total number of garbage objects, it would only do O(M) work, where M is the number of objects that were not garbage. Since in practice, most objects are allocated and then deallocated in a short period of time, this can lead to a substantial performance improvement.

Additionally, the twospace collector made it possible to simplify the allocator side as well. Most malloc() implementations maintain what is called a "free list": a list of which blocks are still available to be allocated. To allocate a new object, malloc() must scan the free list looking for an empty space that's big enough. But the twospace allocator didn't bother with that: it just allocated objects in each space like a stack, by just pushing a pointer up by the desired number of bytes.

So the twospace collector was much faster than malloc(), which was great because Lisp programs would do a lot more allocations than C programs would. Or, to put it another way, Lisp programs needed a way to allocate memory like a stack but with a lifetime that was not limited to the execution stack -- in other words, a stack that could grow infinitely without the program running out of memory. And, in fact, Raymond Chen argues that that's exactly how people should think about GC. I highly recommend his series of blog posts starting with Everybody thinks about garbage collection the wrong way.

But the twospace collector had a major flaw, which is that no program could ever use more than half the available RAM: the other half was always wasted. So the history of GC techniques is the history of attempts to improve on the twospace collector, usually by using heuristics of program behavior. However, GC algorithms inevitably involve tradeoffs, usually preferring to deallocate objects in batches instead of individually, which inevitably leads to delays where objects aren't deallocated immediately.

Edit: To answer your follow-up question, modern GCs generally incorporate the idea of generational garbage collection, where objects are grouped into different "generations" based on lifetime, and an object in one generation gets "promoted" to another generation once it's lived long enough. Sometimes a small difference in object lifetime (e.g. in a request-driven server, storing an object for longer than one request) can lead to a large difference in the amount of time it takes before the object gets deallocated, since it causes it to become more "tenured".

You correctly observe that a true GC has to operate "beneath" the level of malloc() and free(). (As a side note, it's worth learning about how malloc() and free() are implemented -- they aren't magic either!) Additionally, for an effective GC, you either need to be conservative (like the Boehm GC) and never move objects, and check things that might be pointers, or else you need some kind of "opaque pointer" type -- which Java and C# call "references". Opaque pointers are actually great for an allocation system, since it means you can always move objects by updating pointers to them. In a language like C where you interact directly with raw memory addresses, it's never really safe to move objects.

And there are multiple options for GC algorithms. The standard Java runtime contains no less than five collectors (Young, Serial, old CMS, new CMS, and G1, although I think I'm forgetting one) and each has a set of options that are all configurable.

However, GCs aren't magic. Most GCs are just exploiting the time-space tradeoff of batching work, which means that the gains in speed are usually paid for in increased memory usage (compared to manual memory management or refcounting). But the combination of increased program performance and increased programmer performance, versus the low cost of RAM these days, makes the tradeoff usually worth it.

Hopefully that helps make things clearer!

This makes a lot of sense and answers my questions about efficiency. (Marking one of the two whole spaces empty, from beginning to end, would take x nanoseconds, rather than x*N nanoseconds.) It seems that my intuition was formed by CPython, which is where most of my experience is, but there are a lot of better schemes than CPython's. At least the two space collector is an existence proof of a better way, though I take it from your explanation that modern GCs like Java's and CLR's (and web browsers'?) do something more complicated to avoid the problem of losing half the heap space. — Jim Pivarski, Jul 15 '13 at 11:14
To implement this, one needs more low-level access than `malloc` and `free`, which is not something I was expecting. I had thought that these functions were the bottom-level. — Jim Pivarski, Jul 15 '13 at 11:27
@JimPivarski: I've expanded my answer to address your comments -- does that make more sense? — Daniel Pryden, Jul 16 '13 at 03:29
@DanielPryden: Yes, that is helpful, thanks! However, with all this clever work going on to maintain the illusion that the heap space is infinite, it becomes hard for application-level programmers to profile their memory use or provide hints when we know we're going to allocate something big (like GB arrays for number crunching). I'll read more--- I may end up looking for a way to tap into a specific GC's metadata for diagnostics someday. — Jim Pivarski, Jul 16 '13 at 04:55
@JimPivarski: Yes, there are people who argue that the productivity gains of using a GC are lost due to the time it takes to profile and tune the performance of your application as a result. I wouldn't go quite that far, but it certainly adds a lot of complexity in surprising places. Fortunately, Stack Overflow is a great place to get help with this: I'd recommend you ask more specific questions about what you're trying to profile, since that will help you get better answers. — Daniel Pryden, Jul 16 '13 at 07:13
@DanielPryden: I've often wished for a language/framework which could combine reference counting and GC, making a distinction between "owning" and "non-owning" references. Operations on "owning" reference would require adjusting reference counts, but would only be necessary in cases where the recipient of a reference could expect to become the last thing that needs it. Most references could be passed without overhead. When the number of "owning" references hits zero, the object would be notified, but it would continue to *exist* as long as any references exist. — supercat, Jul 16 '13 at 16:57
@DanielPryden: In the majority of situations involving resources that require timely release, each resource will always have one clearly-defined owner, but it may be useful to share things like cached immutable bitmaps, keep them as GDI objects as long as any reference exists, but disposing them as soon as the last reference is gone. Reference-counting would be the best way to manage such resources, but it isn't safe as a "primary" means of memory management. — supercat, Jul 16 '13 at 17:00
@DanielPryden: Generational GC is a key to achieving good performance. It's worthwhile to note that achieving really good GC performance requires being able to identify older-generation objects which do not contain any references that have been written since the last collection of the older generation (meaning that they can't possibly contain any references to newer objects). On many GC's this requires being able to trap an attempt to store a new reference in an older object which is flagged as not containing newer-object references and clear that flag. — supercat, Jul 17 '13 at 14:42
@supercat: Good point. Thanks for adding your comments, both here and in your answer, since they add a lot to the discussion. The idea of "owning" vs "non-owning" references is an interesting one, I'll have to think about that some more. Also note that another important technique that is synergistic with GC is escape analysis -- the best way to collect garbage is not to create it in the first place! — Daniel Pryden, Jul 17 '13 at 15:56
@DanielPryden: I wish .NET had a more kinds of references and byrefs (e.g. ephemeral, returnable, access-limited, etc.). I'm not sure escape detection would help GC performance, but it could eliminate a lot of unnecessary defensive copying. For example, in a fluent interface, `PointMaker.Make.WithX(4).WithY(3).Create();`, having `WithY` can modify the object upon which it is called rather than creating a new object could improve performance, and would be safe iff no other use is made of that object, but there's no way to ensure that would be the case. — supercat, Jul 17 '13 at 16:09
I was just reading a bit about the first garbage collector and was thinking one could easily reduce the memory waste of a two-space collector to 1/(N+1) if one was willing to have collections require up to N passes. If one has 8K of memory and 2K of freespace at the bottom of the free store, grab all live objects that fit entirely in the 2K immediately above the free space and move them to the free space, then move the object (if any) that straddled the upper boundary of that free space. If one ends up with 2.5K of free space, then grab all objects that fit entirely in the 2.5K above that. — supercat, Apr 02 '15 at 22:59
If a two-space GC gives acceptable performance when live objects total 1/3 of total RAM (e.g. one GC pass is required for each 1/6 of RAM worth of total allocations), having the GC only reserve a third of the memory would mean that two GC passes would be required for each 1/3 RAM worth of allocation, but each pass would each have less work to do. Any idea when such improvements would have been implemented? — supercat, Apr 02 '15 at 23:06
@supercat: I don't know of any collectors that work like you describe, but it's fairly common in a generational GC to have multiple younggen or scavenger spaces. See also [Java 7's G1 collector](http://docs.oracle.com/javase/7/docs/technotes/guides/vm/G1.html), which while not quite the same, has a similar concept. — Daniel Pryden, Apr 02 '15 at 23:51
@DanielPryden: Newer collectors are more complicated; such complexity is necessary for good performance when managing many megs or gigs, but would be counterproductive something like an embedded system with under 64K of RAM). I was thinking of approaches for combining the simplicity of the two-space collector with better performance and higher memory utilization. MS-derived BASIC interpreters can use 100% of the string pool, but their GC performance is horrible. I would think that on something like a small ARM, an approach like I described could be quite effective... — supercat, Apr 03 '15 at 15:47
...especially if combined with an "Eden" generation. If the main pool fills bottom-up with Eden above it (separated by an Eden-sized gap), and the Eden collection counts the total allocated space of everything, then the system could know how much memory would be freed by running a main collection to completion, and after each step of the main collection the code could decide whether additional steps were worthwhile. — supercat, Apr 03 '15 at 15:53

supercat · Answer 2 · 2013-07-17T15:17:21.437

To understand garbage-collection, go to a bowling alley and watch how the pinsetter removes fallen pins after the first ball has been rolled. Rather than identifying and removing individual fallen pins, the pinsetter mechanism picks up all the pins that are still standing, lifts them to safety, and then runs a sweeper bar across the lane without regard for how many pins are lying there or where they are located. Once that is done, the pins that were standing are placed back on the lane. Many garbage-collection systems work on much the same principle: they have to do a non-trivial amount of work for each live object to ensure it doesn't get destroyed, but dead objects are destroyed wholesale without even being looked at or noticed.

Addendum

A garbage collector that always has to act on every live item to ensure its preservation is apt to be slow when there are a lot of live items; this is why garbage collectors have, historically, gotten a bad rap. The BASIC interpreter on the Commodore 64 (which was, incidentally, written by Microsoft in the days before MS-DOS) would take many seconds to perform a garbage collection in a program which had an array of a few hundred strings. Performance can be improved enormously if items which survive their first garbage collection can be ignored until many items have survived their first garbage collection, and those which have participated in and survived two garbage collections (note that they won't have to participate in their second collection until many other objects have survived their first) can be ignored until many other objects have also participated and survived in their second. This concept can be partially implemented easily (even on the Commodore 64, one could force all strings that exist at a given moment to be exempt from future garbage collection, which could be useful if on startup a program created large arrays of strings that would never change) but becomes more powerful with a little extra hardware support.

If one figures that a garbage collector will try to pack the objects which are going to be kept as close close to an end of memory as it can, generational support requires doing nothing more than keeping track of what (contiguous) range of memory is used by objects of each generation. All objects of every generation must be scanned to make sure all newer-generation live objects are located and preserved, but older-generation objects don't have to be moved, since the memory they occupy isn't in danger of wholesale elimination. This approach is very simple to implement, and can offer some significant performance improvements versus a non-generational GC, but even the scanning phase of a GC can be expensive if there are many live objects.

They key to speed up a "newer-generation" garbage collections is to observe that if an object "Fred" has not been written since the last garbage-collection in which it participated, it cannot possibly contain any references to any objects which have been created since that time. Consequently, none of the objects to which it holds references would be in any danger of elimination until Fred itself is eligible for elimination. Of course, if references to newer objects have been stored in Fred since the last lower-level GC, those references do need to be scanned. To accomplish this, advanced garbage collectors set up hardware traps which fire when parts of the older generation heap are written. When such a trap fires, it adds the objects in that region to a list of older generation objects which will need to be scanned, and then disables the trap associated with that region. In cases where older-generation objects frequently have references to newer objects stored in them, this extra bookkeeping can hurt performance, but in most cases it ends up being a major performance win.

It's true that a key point I was missing is that dead objects are destroyed wholesale, not one by one. — Jim Pivarski, Jul 17 '13 at 06:49

score 6 · Answer 3 · answered Apr 30 '15 at 11:25

Your thoughts are generally very insightful and well considered. You're just missing some basic information.

Garbage collectors deallocate an object when it is no longer in any scope

That is completely incorrect in general. Garbage collectors work at run-time on a representation in which the notion of scope has long since been removed. For example, inlining and applications of liveness analysis destroy scope.

Tracing garbage collectors recycle space at some point after the last reference disappears. Liveness analysis can have references in the stack frame overwritten with other references even if the variable is still in scope because liveness analysis determined that the variable is never used again and, therefore, is no longer needed.

It seems to me that a framework could deallocate as soon as the number of references reaches zero, but all implementations I've encountered wait a while and then deallocate many objects at a time. My question is, why?

Performance. You can reference count at the level of stack entries and registers but performance is absolutely terrible. All practical reference counting garbage collectors defer counter decrements to the end of scope in order to achieve reasonable (but still bad) performance. State-of-the-art reference counting garbage collectors defer decrements in order to batch them up and can allegedly attain competitive performance.

I'm assuming that the framework keeps an integer for each object

Not necessarily. For example, OCaml uses a single bit.

From a programming point of view, it would be nice if objects deallocated immediately after they go out of scope.

From a programming point of view, it would be nice if code ran 10x faster effortlessly.

Note that destructors inhibit tail call elimination which are invaluable in functional programming.

I'm astonished that it's quicker or is the preferred method for other reasons. It sounds like a lot of work.

Consider a program that solves the n-queens problem by manipulating lists of chess board coordinates. The input is a single integer. The output is a list containing a few board coordinates. The intermediate data is a huge spaghetti stack of linked list nodes. If you coded this up by pre-allocating a big enough stack of linked list nodes, manipulating them to get the answer, copy out the (small) answer and then calling free once on the entire stack then you'd be doing almost exactly the same thing that a generational garbage collector does. In particular, you'd only copy ~6% of your data and you'd deallocate the other ~94% with a single call to free.

That was a perfect happy day scenario for a generational garbage collector that adheres to the hypothesis that "most objects die young and old objects rarely refer to new object". A pathological counter example where generational garbage collectors struggle is filling a hash table with freshly allocated objects. The spine of the hash table is a big array that survives so it will be in the old generation. Every new object inserted into it is a backpointer from the old generation to the new generation. Every new object survives. So generational garbage collectors allocate quickly but then mark everything, copy everything and update pointers to everything and, therefore, run ~3x slower than a simple C or C++ solution would.

Not only could we rely on destructors being executed when we want them to be (one of the Python gotchas is that del is not called at a predictable time), but it would become much easier to memory-profile a program

Note that destructors and garbage collection are orthogonal concepts. For example, .NET provides destructors in the form of IDisposable.

FWIW, in ~15 years of using garbage collected languages I have used memory profiling maybe 3 times.

why not a hybrid approach? Deallocate objects as soon as their reference count hits zero and then also do periodic sweeps to look for circular references. Programmers working in such a framework would have a performance/determinism reason to stick to non-circular references as much as is feasible. It's often feasible (e.g. all data are in the form of JSON objects with no pointers to parents). Is this how any popular garbage collectors work?

CPython does that, I believe. Mathematica and Erlang restrict the heap to be a DAG by design so they can use reference counting alone. GC researchers have proposed related techniques such as trial-deletion as an auxiliary algorithm to detect cycles.

Note also that reference counting is theoretically asymptotically faster than tracing garbage collection as its performance is independent of the size of the (live) heap. In practice, tracing garbage collection is still much faster even with 100GB heaps.

score 0 · Answer 4 · answered Jul 15 '13 at 04:02

0

I think the reason in performance. If you create much objects in a loop and destroy them at the end of a loop step it would take more time to execute that code, then waiting until the program is idle and freeing the data at once. Or on low memory of cause.

answered Jul 15 '13 at 04:02

rekire

47,260
30
167
264

If deallocation takes x nanoseconds per object and you have N objects to deallocate, then the same time is taken whether you do it during the loop or after the loop. Perhaps the argument is that most programs do a lot of work, then wait for resources/user input, then work again. In that case, spending the x*N nanoseconds after the loop is beneficial, since the program would just be waiting. But that doesn't apply to number-crunching applications, and it would be interesting to know that the optimization only applies to run-and-wait type programs. – Jim Pivarski Jul 15 '13 at 04:10
@JimPivarski: As I noted (and you saw) in my answer, many GC's can destroy old objects wholesale without having to do *any* work for each one. This can be a big win if the majority of objects are abandoned before their first GC. Even without that ability, GC would still be a win on multi-processor systems because synchronizing actions processors is expensive. In a system with stop-the-world GC, memory-management-related synchronization is only required once every GC cycle, rather than once for every allocation and deallocation (note that many multi-processor systems allocate... – supercat Jul 17 '13 at 14:47
2

...a separate "first generation" heap for each processor. This doesn't require much extra storage, since the first generation heap is often less than 5% of the total memory space, but it means that a processor can allocate memory out of its own heap without any other processors having to know or care about its doing so. Since multiprocessor synchronization is expensive *and is, relatively speaking, getting more expensive every year*, that's a big win. – supercat Jul 17 '13 at 14:49

score 0 · Answer 5 · answered Jul 15 '13 at 04:03

0

Where I've come across GC systems they wait until they need to run, so that the relocation of objects still in use can be done once, rather than many times.

Consider a series of objects allocated sequentially in memory:

Object 1
Object 2
Object 3
Object 4
Object 5

If Object 2 can be deallocated, and GC operates immediately, Objects 3,4 and 5 will all need to be moved.

Now consider that object 4 can be deallocated, GC will now move Object 5 next to Object 3. Object 5 has been moved twice

However, if GC waits a short while, both Objects2 and 4 can be removed at the same time, meaning that Object 5 is moved once, and moved further.

Multiply the number of objects by, say, 100 and you can see considerable time savings from this approach

answered Jul 15 '13 at 04:03

2

Why are the objects moving? If they're declared on the heap, rather than the stack, it isn't a fundamental problem if there are gaps between them--- the surviving objects can stay where they are. A fragmented heap does become a problem (you run out of contiguous spots to put new big objects) and I can understand why a heap-defragmenter should wait and only run periodically. Is a garbage collector also a defragmenter? And if so, is its role as defragmenter more significant than its role as deallocator? – Jim Pivarski Jul 15 '13 at 04:16
Suppose a loop creates and eliminates objects that are all the same size (which is a typical case). It would be useful for them to deallocate right away, *without* re-location, since the gaps left between them can be filled up with the next loop iteration's objects. That allows the loop to run without growing in memory use. A defragmenter would want to clean up the survivors' locations later, however. I see good reason to delay the defragmentation, but not the deallocation. – Jim Pivarski Jul 15 '13 at 04:21
No - it's not a problem if there are gaps between objects on a heap. As you rightly point out, it's the lack of contiguous gaps that becomes a problem. If GC runs immediately on deallocation, there's a waste of resource in moving repeatedly. If it waits until there's no room left at all, GC becomes a time consuming task that impacts on responsiveness. It's a balance. Firefox recently changed their GC algorithm to a more progressive approach because of this latter problem. From what I see now, FF does more GC overall, but with less impact on the response. – Jul 15 '13 at 04:27
To clarify: I agree that defragmentation should wait for an optimal time, not too soon (lots of wasteful re-locations) and not too late (the mess gets out of hand). I'm asking about why it waits to deallocate. – Jim Pivarski Jul 15 '13 at 04:39
@Mike W: The idea of compactifying the heap immediately after each deallocation is plain absurd. Imagine you have just freed 10 bytes, would you compact you 10 GB heap? Would you? – maaartinus Oct 17 '13 at 22:40
@maaartinus I wouldn't. And that wasn't what I was saying. I don't understand your point. – Oct 17 '13 at 22:42
@Mike W: Then I don't understand what you were saying. You wrote "If Object 2 can be deallocated, and GC operates immediately, Objects 3,4 and 5 will all need to be moved". How else can this be understood? WTF need they? – maaartinus Oct 17 '13 at 23:02

Jatin · Answer 6 · 2013-07-15T06:48:59.137

@Jim has answered quite some, I will add more to it.

Firstly what makes you think that deallocating[A1] as soon as the count is 0 is a good alternative?

Garbage Collectors not only only deallocate objects but are responsible for the complete memory management. Starting with fragmentation, one of the biggest Issues with garbage collectors. If not done properly, it will result in unnecessary page hits and cache misses. Garbage collectors from the start are designed to handle this Issue. With different generations, it becomes easier to handle this. With A[1], periodically a thread has to set it up and handle it.

Moreover it turns out that clearing multiple objects is faster than doing as in A[1]. (Think of it, for a room with sand spread - it is faster to clear all of them together rather than picking each one of them individually)

Secondly, for thread-safety in Multi-Threaded systems, one will have to hold a lock for every Object to increase/decrease the count which is bad performance and extra memory.Plus Modern collectors have the ability to do it in parallel and not Stop The World (Ex: Java's ParallelGC), i wonder how this can happen with A[1].

score 0 · Answer 7 · answered Apr 30 '15 at 11:38

Garbage collection using reference counting is very slow, especially in a threaded environment.

I really recommend this post by Brian Harry.

A code sample is provided there which more than enough to convince me (C#):

public interface IRefCounted : IDisposable
{
        void AddRef();
}

// ref counted base class.
class RefCountable : IRefCountable
{
        private m_ref;
        public RefCountable()
        {
                m_ref = 1;
        }
        public void AddRef()
        {
                Interlocked.Increment(ref m_ref);
        }
        public void Dispose()
        {
                if (Interlocked.Decrement(ref m_ref) == 0)
                        OnFinalDispose();
        }
        protected virtual void OnFinalDispose()
        {
        }
}

Interlocked.Increment(ref m_ref) is an atomic operation that takes hundreds of memory cycles.

Why do garbage collectors wait before deallocating?

7 Answers7

Linked