3

If I were to implement a garbage collected interpreted language in C, how can I go about providing precise (i.e. not conservative) garbage collection without writing my own garbage collector? Are there libraries available for this? If so, which ones? I understand that I would have to maintain certain invariants on my end for any objects tracked by the garbage collector.

Matt
  • 21,026
  • 18
  • 63
  • 115
  • Based on your implementation, count references to dynamic objects. Once they reach zero - discard. – Eugene Sh. Feb 09 '15 at 16:28
  • You might like to take a look at some examples, like the Lua reference-impementation. It uses a tracing GC. – Deduplicator Feb 09 '15 at 16:29
  • @EugeneSh. That doesn't work for circular references, for example. I'm looking for more of a conventional mark and sweep GC. – Matt Feb 09 '15 at 16:32
  • I don't think you will get any good answer, since it is totally based on the interpreter you are implementing. Maybe using an existing one and modifying only the syntax part will help. – Eugene Sh. Feb 09 '15 at 16:34
  • @EugeneSh. Are you saying that every interpreter implements its own GC? I was really hoping for some kind of portable library. – Matt Feb 09 '15 at 16:37
  • Well ,since you are implementing your own interpreter, you might want to use some kind of virtual memory model, or even virtual machine to run the code. Well, this is what I would do in this case. So this is totally up to you how you manage the memory. – Eugene Sh. Feb 09 '15 at 16:41
  • @EugeneSh. I'm totally ok with something like that, as long as the virtual machine can be run from C, and not the other way around. For example, can C start a JVM instance and allocate memory through it? – Matt Feb 09 '15 at 16:47
  • @Matt, no I was talking about specific virtual machine implemented to run the custom language. – Eugene Sh. Feb 09 '15 at 16:48
  • @EugeneSh. Ok, but is there a library that would do that for me? I don't want to have to implement my own VM. – Matt Feb 09 '15 at 16:53
  • Why do you dislike conservative GCs like Boehm's GC? – Basile Starynkevitch Apr 23 '15 at 15:51
  • Is your implementation free software (e.g. on github)? If yes, give the link – Basile Starynkevitch Apr 23 '15 at 15:55
  • @BasileStarynkevitch: I want to provide a guarantee that there will not be any memory leaks (except of course those caused by client code). So far this is just a personal project and I have not plans to distribute it (yet). – Matt Apr 23 '15 at 17:13

3 Answers3

5

If you want a precise GC (not a conservative one, like Boehm's GC, which performs quite well in practice) you should track local pointer (to GC-ed data) variables, or else invoke the GC only with a nearly empty call stack when you are sure there are no such local variables (btw, the GCC compiler has such a mark&sweep garbage collector - with marking routines generated by some specialized gengtype C++ code generator; that GGC is invoked only between passes). Of course you should also track global (including static, or thread local) pointer (to GC-ed data) variables also.

Alternatively, have some bytecode virtual machine (like OCaml or NekoVM have), then the local GC-ed variables are those in the stacks and/or registers of your bytecode VM, and you trigger your GC at specific and carefully chosen points of your VM interpreter. (See this explanation of Ocaml GC).

You should read more about Garbage Collection techniques, see the GC handbook.

If your GC is copying generational, you need to implement write barriers (to handle mutation of old data pointing to new zone). You could use my old Qish GC (which I don't maintain much anymore), or Ravenbrook's MPS, or write your own generational copying GC (this is not that hard in theory, but debugging GCs is a nightmare in practice, so it is a lot of work).

You probably want to use some macro tricks (like my Qish does) to help keeping your local variables. See the Living in harmony with the garbage collector section of Ocaml documentation as an example (or look inside Qish).

Notice that a generational copying GC is not friendly to handle in manually written C code (because you need to explicitly keep local pointers, and because you need a write barrier to remember when an old value is modified to have a pointer to the new generation). If you want to do that, your C code should be in A-normal form (you cannot code x=f(g(y),z); but you need to code temp=g(y); x=f(temp,z); and add temp as a local variable, assuming that x, y, z are local GC-ed variables and that both f and g return a GC-ed pointer). In practice it is much easier to generate the C code. See my MELT domain specific language (to extend and customize GCC) as an example.

If your language is genuinely multithreaded (several mutator threads allocating in parallel), then coding a GC becomes quite tricky. It might take several months of work (and is probably a nightmare to debug).

Actually, I would today recommend using Boehm's GC (notice that it is multithread friendly). A naive mark&sweep handcoded GC would probably not be faster than Boehm's GC. And you won't be able (and I don't recommend) to use GGC, the garbage collector internal to GCC (which, IMNSHO, is not very good; it was a dirty hack design many years ago).

BTW, you might consider customizing -e.g. with MELT- the GCC compiler (by adding some application specific __attribute__ or #pragma) to help your GC. With some work, you could generate some of the marking routines, etc. However, that approach might be quite painful (I really don't know). Notice that MELT (free software, GPLv3+) contains a copying generational GC whose old generation is the GGC heap, so you could at least look inside the code of melt-runtime.cc

PS. I also recommend Queinnec's book: Lisp In Small Pieces; it has some interesting material about GC and their connection to programming languages, and it is a great book to read when you are implementing an interpreter. Scott's book on Programming Languages Pragmatics is also worth reading.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • Wow, I didn't know GCC came with a GC (and it's called GGC... too many acronyms!). That seems like it's exactly the sort of thing I was looking for, thanks! – Matt Apr 23 '15 at 12:56
  • The GGC is only internal to the GCC compiler. It does not matter to you (except as an example of GC, which I don't like much because I feel that GGC is not a very good GC implementation) – Basile Starynkevitch Apr 23 '15 at 13:12
  • @Oh, that's disappointing. But it seems like it may be possible to use it if I link with the GCC source code. I guess borrowing an existing implementation from an open source project is probably the best way to get what I want. But thanks for all the other tips, too! – Matt Apr 23 '15 at 14:58
  • 1
    Look into Ravenbrook's MPS. It probably is very suited to your needs – Basile Starynkevitch Apr 23 '15 at 15:55
4

For C programs, there are 2 options: the Boehm GC which replaces malloc (it is a conservative GC, so perhaps not exactly what you're looking for but it's either that or...), or write your own.

But writing your own isn't that hard. Do the mark-sweep algorithm. The root set for marking will be your symbol table. And you'll need another table or linked-list to track all of the allocated memory that can be freed. When you sweep through the list of allocations, free anything without a mark.

The actual coding will of course be more complicated because you have to iterate through these two kinds of data structures, but the algorithm itself is very simple. You can do it.


A few years ago, I found myself on the same search and these were (and AFAIK still are) the results. Writing your own is tremendously rewarding and worthwhile.

In practice, a great many other issues will arise as Basile's answer touches upon.

If the garbage collector is called from deep in the call-stack (by an allocation routine that needs more memory, perhaps), then care must be taken about any allocations whose handles are still held in local variables of C functions in the call stack, and not saved-out to their symbol-table or database locations. In my postscript interpreter I dealt with this by using a temporary stack which all allocators pushed to. This stack was cleared by the main loop after all subroutines had returned, and it was considered part of the root set during marking. In my APL interpreter, I call the GC everytime around the mainloop. For the little programs in little languages, the speed issues are less vital than the more-dreaded memory leakage, at least among the circles which have influenced me.

luser droog
  • 18,988
  • 3
  • 53
  • 105
1

When implementing such a language, your interpreter needs to keep track of all objects in the program it's running, including knowledge of their types and what part of the data is a reference to other data. Then it's trivial for you to walk all the data and implement whatever sort of garbage collector you like. No bogus hacks like trying to determine where the C implementation's "heap"/"stack"/etc. are located or guessing at what might be a pointer is needed, because you're dealing exactly with the data whose structure you know.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • Obviously I can take the time and implement my own less-than-mediocre GC. I want to know whether there are existing solutions to this problem. And of course I know that I will have to provide some sort of information about where the pointers are. – Matt Feb 09 '15 at 19:00
  • As far as I know, asking for library suggestions is off-topic on SO, but I'm not aware of anything you can just plug in anyway. – R.. GitHub STOP HELPING ICE Feb 10 '15 at 04:59