Hardware Assisted Garbage Collection

Question

I was thinking on ways that functional languages could be more tied directly to their hardware and was wondering about any hardware implementations of garbage collection.

This would speed things up significantly as the hardware itself would implicitly handle all collection, rather than the runtime of some environment.

Is this what LISP Machines did? Has there been any further research into this idea? Is this too domain specific?

Thoughts? Objections? Discuss.

the garbage truck that comes by my house picks up and empties the trash can by itself. does that count? — kenwarner, Aug 15 '09 at 03:30
I wonder if this would enable whole-of-os GC - imagine having no process partitioning (and by extension wastage from fragmentation, and distributed reserved empty space for new alloc). Install time validation of software, and the ability to pass objects between applications without copying. I like enablers. The hardware assistance could include statistical measurements ongoing finding cold blocks of memory, as well as reference counting. — Kind Contributor, Oct 19 '13 at 01:39
There seem to be some new developments in the area of Hardware Accelerated GC: https://people.eecs.berkeley.edu/~krste/papers/maas-isca18-hwgc.pdf — janekb04, Dec 16 '20 at 10:40

Vincent · Accepted Answer · 2021-02-23T01:38:36.547

Because of Generational Collection, I'd have to say that tracing and copying are not huge bottlenecks to GC.

What would help, is hardware-assisted READ barriers which take away the need for 'stop the world' pauses when doing stack scans and marking the heap.

Azul Systems has done this: http://www.azulsystems.com/products/compute_appliance.htm They gave a presentation at JavaOne on how their hardware modifications allowed for completely pauseless GC.

Another improvement would be hardware assisted write barriers for keeping track of remembered sets.

Generational GCs, and even more so for G1 or Garbage First, reduce the amount of heap they have to scan by only scanning a partition, and keeping a list of remembered sets for cross-partition pointers.

The problem is this means ANY time the mutator sets a pointer it also has to put an entry in the appropriate rememered set. So you have (small) overhead even when you're not GCing. If you can reduce this, you'd reduce both the pause times neccessary for GCing, and overall program performance.

I was wrong, Garbage First scans/marks the whole heap during a GC but only copies (or doesn't copy) sections at a time. — Vincent, Mar 16 '09 at 23:48
Why stop at Garbagw collection? It could also help optimize unmanaged memory as well. It can allocate and deallocate memory. — ATL_DEV, Aug 04 '16 at 14:03

Aaron Digulla · Answer 2 · 2009-03-13T08:20:57.107

5

One obvious solution was to have memory pointers which are larger than your available RAM, for example, 34bit pointers on a 32 bit machine. Or use the uppermost 8 bits of a 32bit machine when you have only 16MB of RAM (2^24). The Oberon machines at the ETH Zurich used such a scheme with a lot success until RAM became too cheap. That was around 1994, so the idea is quite old.

This gives you a couple of bits where you can store object state (like "this is a new object" and "I just touched this object"). When doing the GC, prefer objects with "this is new" and avoid "just touched".

This might actually see a renaissance because no one has 2^64 bytes of RAM (= 2^67 bits; there are about 10^80 ~ 2^240 atoms in the universe, so it might not be possible to have that much RAM ever). This means you could use a couple of bits in todays machines if the VM can tell the OS how to map the memory.

edited Mar 13 '09 at 08:20

answered Feb 12 '09 at 16:59

Aaron Digulla

321,842
108
597
820

I seem to remember the original Apple Macs used to do something similar: 32bit ptrs, but top 8 bits were some sort of tag and only 24bits address. Some things were allowed to be relocated to avoid fragmentation. They had to get rid of it by time >16MByte machines appeared! – timday Feb 12 '09 at 21:11
1

Actually, there are about 10^80 atoms in the universe, according to Wikipedia. – Matthew Olenik Mar 13 '09 at 01:22
1

If your memory is 4-byte aligned and your pointers only ever point to 4-byte aligned regions (you could probably get away with this on high level languages like Java where there are only references) then you could get away with using the last 2-bits of each pointer for such data. – tumtumtum Mar 03 '13 at 14:56
@tumtumtum: For performance reasons, a lot of system align data to 8 bytes, so you'd have 3 bits. But you'd need a way to differentiate between a pointer to a structure and a pointer to a character in an array :-/ – Aaron Digulla Mar 04 '13 at 10:10

score 5 · Answer 3 · edited Oct 09 '14 at 07:29

5

Yes. Look at the related work sections of these 2 papers:

https://research.microsoft.com/en-us/um/people/simonpj/papers/parallel-gc/index.htm http://www.filpizlo.com/papers/pizlo-ismm2007-stopless.pdf

Or at this one:

http://researcher.watson.ibm.com/researcher/files/us-bacon/Bacon12StallFree.pdf

edited Oct 09 '14 at 07:29

kram1032

237
1
14

answered Feb 12 '09 at 20:50

ja.

4,245
20
22

score 4 · Answer 4 · answered Aug 15 '09 at 03:27

4

You're a grad student, sounds like a good topic for a research grant to me. Look into FPGA design and computer architechture there are plenty of free processor design availiable on http://www.opencores.org/

Garbage collection can be implemented as a background task, it's already intended for parallel operation.

Pete

answered Aug 15 '09 at 03:27

NoMoreZealots

5,274
7
39
56

FPGA are great. There are new Rasberry Pi Clones with built-in FPGA units that can be programmed. – ATL_DEV Aug 04 '16 at 14:04

score 4 · Answer 5 · answered Feb 12 '09 at 17:06

4

There was an article on lambda the ultimate describing how you need a GC-aware virtual memory manager to have a really efficient GC, and VM mapping is done mostly by hardware these days. Here you are :)

answered Feb 12 '09 at 17:06

Nemanja Trifunovic

24,346
3
50
88

score 2 · Answer 6 · answered Feb 12 '09 at 16:59

2

I'm pretty sure that some prototypes should exist. But develop, and produce hardware specific features is very expensive. It took very long time to implement MMU or TLB at a hardware level, which are quite easy to implement.

GC is too big to be efficiently implemented into hardware level.

answered Feb 12 '09 at 16:59

Jérôme

2,640
3
26
39

4

Things like OpenGL and virtualization are arguably too big to be implemented in hardware, too, but you can add *some* hardware support which makes it possible for these to go really fast. He did say "hardware *assisted*", not "hardware-only". – Ken Feb 12 '09 at 21:01
I'd suggest that if one takes a soft core and butchers the MMU hardware assisted GC would be possible – Tim Williscroft Mar 13 '09 at 00:41
4

The argument that you can't implement the exact same algorithm is absurd. It is the same as saying "you can't implement a machine which can implement that algorthm". In this case such a machine already exist because a cpu is such a machine, it's just not optimized for that particular task. A counter is one of the simplist constructs that can be implement in logic. The rest is a combination of cashing, virtual memory, and possibly DMA. – NoMoreZealots Aug 15 '09 at 03:06

score 2 · Answer 7 · answered Mar 13 '09 at 00:39

Older sparc systems had tagged memory ( 33 bits) which was usable to mark addresses. Not fitted today ?

This came from their LISP heritage IIRC.

One of my friends built a generational GC that tagged by stealing a bit from primitives. It worked better.

So, yes it can be done, but nodody bothers tagging things anymore.

runT1mes' comments about hardware assisted generational GC are worth reading.

given a sufficiently big gate array (vertex-III springs to mind) an enhanced MMU that supported GC activities would be possible.

score 1 · Answer 8 · answered Mar 28 '14 at 21:42

1

Several great minds at MIT back in the 80s created the SCHEME-79 chip, which directly interpreted a dialect of Scheme, was designed with LISP DSLs, and had hardware-gc built in.

Scheme-79 die

answered Mar 28 '14 at 21:42

Philip Conrad

1,451
1
13
22

score 1 · Answer 9 · edited May 23 '17 at 12:00

Probably the most relevant piece of data needed here is, how much time (percentage of CPU time) is presently being spent on garbage collection? It may be a non-problem.

If we do go after this, we have to consider that the hardware is fooling with memory "behind our back". I guess this would be "another thread", in modern parlance. Some memory might be unavailable if it were being examined (maybe solvable with dual-port memory), and certainly if the HWGC is going to move stuff around, then it would have to lock out the other processes so it didn't interfere with them. And do it in a way that fits into the architecture and language(s) in use. So, yeah, domain specific.

Look what just appeared... in another post... Looking at java's GC log.

"if the HWGC is going to move stuff around" - well you won't "move", but rather copy, aborting if the source memory is written to during the "copy" operation. — Kind Contributor, Oct 19 '13 at 01:35

score 1 · Answer 10 · answered Feb 12 '09 at 22:26

I gather the biggest problem is CPU registers and the stack. One of the things you have to do during GC is traverse all the pointers in your system, which means knowing what those pointers are. If one of those pointers is currently in a CPU register then you have to traverse that as well. Similarly if you have a pointer on the stack. So every stack frame has to have some sort of map saying what is a pointer and what isn't, and before you do any GC traversing you have to get any pointers out into memory.

You also run into problems with closures and continuations, because suddenly your stack stops being a simple LIFO structure.

The obvious way is to never hold pointers on the CPU stack or in registers. Instead you have each stack frame as an object pointing to its predecessor. But that kills performance.

Well, it goes with out saying GC would have to integrated into the processor and work in combination with the MMU and cache system. — NoMoreZealots, Aug 15 '09 at 03:30

score 0 · Answer 11 · answered Feb 12 '09 at 16:58

0

Why would it "speed things up"? What exactly should the hardware be doing? It still has to traverse all references between objects, which means it has to run through a big chunk of data in main memory, which is the main reason why it takes time. What would you gain by this? Which particular operation is it that you think could be done faster with hardware support? Most of the work in a garbage collector is just following pointers/references, and occasionally copying live objects from one heap to another. how is this different from the instructions a CPU already supports?

With that said, why do you think we need faster garbage collection? A modern GC can already be very fast and efficient, to the point where it's basically a solved problem.

answered Feb 12 '09 at 16:58

jalf

243,077
51
345
550

what 'tude? I simply asked for more details on what you thought it'd achieve (and why it would be beneficial). Sorry if it sounded harsh. I simply don't see what it is you want to hardware-accelerate exactly. – jalf Feb 12 '09 at 19:44
But when you say that "this would speed things up significantly", I think it's fair game to ask *why*. Don't you? – jalf Feb 12 '09 at 19:46
I was under the impression when things are usually implemented in a hardware version, it tends to outpace the software version. Not in all cases, as I have seen documentation of virtualized functions running faster than their hardware counterparts. You just came off as snobby dude. – Nicholas Mancuso Feb 13 '09 at 00:32
Just tell me why you think its an going to be slower. I'll stop because I dont want to cause a flame war or something. But Other than _speed_ perhaps give some reasons why it shouldn't be done. Because it seems there are gobs of papers that have done prototypes of these. – Nicholas Mancuso Feb 13 '09 at 00:33
1

IBM wouldn't be pouring millions of dollars into Metronome, Sun wouldn't be scrapping the ConcLowPause collector for Garbage-First if it was a 'solved problem'. Sure, its solved on commodity hardware, but you talk about 64 gig heaps, mutlicore, and reducing pause times, we have a long ways to go. – Vincent Feb 13 '09 at 19:57

Hardware Assisted Garbage Collection

11 Answers11