How can garbage collectors be faster than explicit memory deallocation?

Question

I was reading this html generated, (may expire, Here is the original ps file.)

GC Myth 3: Garbage collectors are always slower than explicit memory deallocation.
GC Myth 4: Garbage collectors are always faster than explicit memory deallocation.

This was a big WTF for me. How would GC be faster then explicit memory deallocation? isnt it essentially calling a explicit memory deallocator when it frees the memory/make it for use again? so.... wtf.... what does it actually mean?

Very small objects & large sparse heaps ==> GC is usually cheaper, especially with threads

I still don't understand it. Its like saying C++ is faster then machine code (if you don't understand the wtf in this sentence please stop programming. Let the -1 begin). After a quick google one source suggested its faster when you have a lot of memory. What i am thinking is it means it doesn't bother will the free at all. Sure that can be fast and i have written a custom allocator that does that very thing, not free at all (void free(void*p){}) in ONE application that doesnt free any objects (it only frees at end when it terminates) and has the definition mostly in case of libs and something like stl. So... i am pretty sure this will be faster the GC as well. If i still want free-ing i guess i can use an allocator that uses a deque or its own implementation thats essentially

if (freeptr < someaddr) {
    *freeptr=ptr;
    ++freeptr; 
}
else
{
    freestuff();
    freeptr = freeptrroot;
}

which i am sure would be really fast. I sort of answered my question already. The case the GC collector is never called is the case it would be faster but... i am sure that is not what the document means as it mention two collectors in its test. i am sure the very same application would be slower if the GC collector is called even once no matter what GC used. If its known to never need free then an empty free body can be used like that one app i had.

Anyways, i post this question for further insight.

Example fail. C++ can in fact, for everything on a larger scale, be faster than machine code, because C++ compilers are notoriously better than assembly programmers at writing near-ideal machine code when the human has long ago abandoned sanity, and do so very consistently even when generating billions of instructions. And that's assuming you meant runtime performance of a given program, not including whether the program gets finished at all. — , Apr 28 '11 at 10:01
@delnan: I didnt mean "writing". Also in the case of SSE and such compilers don't know it can use or know where to use the special instructions. — , Apr 28 '11 at 18:19
You're underestimating compilers. [Several compilers use them since quite a while](http://en.wikipedia.org/wiki/SSE2#Compiler_usage). Also see http://stackoverflow.com/questions/405770/why-are-compilers-so-stupid, which doesn't quite fit this, but still includes a few example of GCC generating pretty much optimal code. — , Apr 28 '11 at 19:05
VC++ generates lovely SSE code. Just set the option in "properties" and watch it go. — Jive Dadson, Oct 04 '12 at 20:34
"C++ can in fact, for everything on a larger scale, be faster than machine code, because C++ compilers are notoriously better than assembly programmers at writing near-ideal machine code". I've studied this many times over the years and found time and time again that it is not true: compilers produce quite poor machine code. I studied it again last year and found once more that I (with no x86 expertise) could easily write better assembler than GCC. http://flyingfrogblog.blogspot.co.uk/2012/04/x86-code-generation-quality.html — J D, Jan 05 '13 at 23:41

J D · Accepted Answer · 2016-02-02T10:38:27.907

28

How would GC be faster then explicit memory deallocation?

GCs can pointer-bump allocate into a thread-local generation and then rely upon copying collection to handle the (relatively) uncommon case of evacuating the survivors. Traditional allocators like malloc often compete for global locks and search trees.
GCs can deallocate many dead blocks simultaneously by resetting the thread-local allocation buffer instead of calling free on each block in turn, i.e. O(1) instead of O(n).
By compacting old blocks so more of them fit into each cache line. The improved locality increases cache efficiency.
By taking advantage of extra static information such as immutable types.
By taking advantage of extra dynamic information such as the changing topology of the heap via the data recorded by the write barrier.
By making more efficient techniques tractable, e.g. by removing the headache of manual memory management from wait free algorithms.
By deferring deallocation to a more appropriate time or off-loading it to another core. (thanks to Andrew Hill for this idea!)

edited Feb 02 '16 at 10:38

answered Jan 13 '12 at 21:11

J D

48,105
13
171
274

1

7. By deferring de-allocation to a optimal time of otherwise minimal cpu contention (at a cost of minor extra ram usage), and doing so in a different thread from the main one. – Andrew Hill Feb 02 '16 at 05:29
This almost reads like: A "very carefully designed and optimized" state-of-the-art GC will beat a "naive" traditional allocator. I would argue that that is not a fair comparison. Which of these 7 are really impossible for a more advanced allocator? For example, an allocator can also keep track of thread-local stores, search trees and try to prevent global locks. – André Jun 28 '17 at 11:07
All of 1 to 6 are impossible to do from an allocator alone. – J D Jun 28 '17 at 11:31

kostja · Answer 2 · 2011-04-28T08:58:41.657

14

One approach to make GC faster then explicit deallocation is to deallocate implicitly :

the heap is divided in partitions, and the VM switches between the partitions from time to time (when a partition gets too full for example). Live objects are copied to the new partition and all the dead objects are not deallocated - they are just left forgotten. So the deallocation itself ends up costing nothing. The additional benefit of this approach is that the heap defragmentation is a free bonus.

Please note this is a very general description of the actual processes.

edited Apr 28 '11 at 08:58

answered Apr 28 '11 at 08:52

kostja

60,521
48
179
224

I like it. I never thought about a speed bonus due to defragmentation and fitting the memory into less cache page. Nice. +1 – Apr 28 '11 at 09:00

score 10 · Answer 3 · answered Apr 28 '11 at 09:05

10

The trick is, that the underlying allocator for garbage collector can be much simpler than the explicit one and take some shortcuts that the explicit one can't.

If the collector is copying (java and .net and ocaml and haskell runtimes and many others actually use one), freeing is done in big blocks and allocating is just pointer increment and cost is payed per object surviving collection. So it's faster especially when there are many short-lived temporary objects, which is quite common in these languages.
Even for a non-copying collector (like the Boehm's one) the fact that objects are freed in batches saves a lot of work in combining the adjacent free chunks. So if the collection does not need to be run too often, it can easily be faster.
And, well, many standard library malloc/free implementations just suck. That's why there are projects like umem and libraries like glib have their own light-weight version.

answered Apr 28 '11 at 09:05

Jan Hudec

73,652
13
125
172

1

+1. A question. How does pointer increment work? I'm thinking something like ptr+=chunksize wont work since there will be many holes in memory and you wont know which is availible. So maybe having a list of what memory has moved and changing the ptr each time it finds it. The latter sounds like it may work but this is a guess. Do you know of what a good implementation may do? – Apr 28 '11 at 09:11
4

@acidzombie: Copying GCs reserve one large chunk and use that. The pointer moves *only* up. It's not decremented when deallocating, specifically to avoid dealing with holes (except perhaps as an optimization when the "top" is deallocated, i.e. no live objects remain between the pointer and the deallocated object). When the current area is exhausted, all live objects are copied into a new area (without holes) and the pointer starts after the last object. (Also see http://blogs.msdn.com/b/abhinaba/archive/2009/02/02/back-to-basics-copying-garbage-collection.aspx) – Apr 28 '11 at 19:10

score 1 · Answer 4 · answered Jun 18 '12 at 23:08

A factor not yet mentioned is that when using manual memory allocation, even if object references are guaranteed not to form cycles, determining when the last entity to hold a reference has abandoned it can be expensive, typically requiring the use of reference counters, reference lists, or other means of tracking object usage. Such techniques aren't too bad on single-processor systems, where the cost of an atomic increment may be essentially the same as an ordinary one, but they scale very badly on multi-processor systems, where atomic-increment operations are comparatively expensive.

How can garbage collectors be faster than explicit memory deallocation?

4 Answers4

Linked