Garbage collection issue in an interpreter implemented in C

Question

I'm working on a hobby compiler/interpreter for a toy procedural language and I've implemented most of the features I set out to explore except for a good garbage collection algorithm (similar to this guy). I've read quite a bit about various algorithms and I have a general idea of how to implement them. Earlier iterations of my language runtime used reference counting but I dropped it to learn something more advanced so I'm now considering a mark and copy compacting algorithm.

My first issue in getting started is preventing the algorithm from collecting 'objects' in native extension functions (i.e. functions written in C). The root set consists of 'objects' on the interpreter's stack and 'objects' in symbol tables, and I shouldn't have too much trouble with these, however, if a container 'object' is created in a C function, then populated with child 'objects', how can I prevent the GC from collecting them since it's not actually on the interpreter stack or bound to a symbol?

Things that make implementing GC easier:

All 'objects' in my language are of a builtin type (e.g. not object oriented)
The interpreter stack is just a stack of pointers to structs
Symbol tables are just arrays of pointers to structs

User code:

f = open('words.txt', 'r');
lines = readlines(f);
close(f);

Interpreter (after parsing, compiling to bytecode...):

push filename, open_mode
call builtin_fopen which returns a struct wrapping a FILE*
store result in symbol f
push symbol f
call builtin_flines which creates a list type l, then used C fread to read each line of the file as a string type, appending it to the list l
store result in symbol lines, and so on....

Now if the GC ran while one of the strings containing a line in the file was being allocated, the root set does not yet have any reference to l, so it should get collected. Any ideas on how to handle this better?

"mark and copy" -- Presumably you mean mark and sweep. All the objects allocated by your interpreter should be in their own heap, separate from that used by extension functions. If you want to GC extension function memory, you're looking at something like the Boehm conservative GC. — Jim Balter, Apr 07 '13 at 21:35
I really meant [mark and compact](http://en.wikipedia.org/wiki/Mark-compact_algorithm) which copies reachable objects. The idea is that the extension functions allocate objects intended to be put on the interpreter stack and/or bound to symbols, and therefore should be garbage collected at a later point. — joe, Apr 07 '13 at 22:34
You might find my question [here](http://stackoverflow.com/questions/6366868/how-to-schedule-collection-cycles-for-custom-mark-sweep-collector) of interest. — luser droog, Apr 08 '13 at 04:14
I think my solution is a combination of Chris Dodd, n.m., and Derek Pressnall's answers. n.m.'s method of keeping special allocations above the top of the heap's top pointer is useful, but doesn't solve the issue of garbage collecting in the middle of the native function's execution. Chris Dodd's method of creating a run-time interface for native functions to obtain handles to objects is great, but Derek Pressnall actually explained how to do so, by passing a reference to the object's pointer to the mutator. This explanation gets my 50 points. — joe, Apr 21 '13 at 16:15

score 3 · Answer 1 · answered Apr 07 '13 at 21:41

Dedicate a separate contiguous allocation arena for the interpreter's heap. Never collect anything outside of the arena.
You always have the arena's current top (assuming it grows from lower to higher addresses). Everything above the top is not collectible but considered in the root set. A builtin function that has to allocate several linked objects allocates them above the top, then moves the top up so that all the allocated objects end up in the collectible heap at once. If the collection happens in the midst of the function execution, objects above the top are moved to the new heap all at once.

This sounds like a good method, however I'm not sure how to handle updating reference pointers if collection happens in the midst of native function execution. — joe, Apr 09 '13 at 01:06

score 1 · Answer 2 · answered Apr 07 '13 at 23:52

You need to provide your native functions with an interface via which they can tell the garbage collector what objects they have references to, and then have them use that interface.

The easiest way is probably to not let the native code have direct pointers to interpreter/garbage collected data at all. Instead, you give the native code a handle to the object and have it call back into the runtime to get values from an object. In your example, builtin_flines would call the runtime to allocate a list and get back a handle to it. It would then read lines, and call the runtime to append each one to the list, finally returning the complete list. The runtime would manage all the handles for a given native call, freeing them up after the native call returns.

I'd like for native functions to use the same API to allocate new objects, e.g. `LuciObject *list = LuciList_new()`, as is used in the interpreter. Is it unreasonable to pass a type of stack to all native functions and require them to immediately push allocated objects onto the stack? The stack would be accessible by the interpreter/garbage collector in order to include it in the root set. — joe, Apr 09 '13 at 01:11

score 1 · Accepted Answer · answered Apr 17 '13 at 03:43

Since I'm the original "this" guy you mentioned, though I could give you some insight on your first issue based on what I've designed so far in my project (I promise I will blog about it eventually). So first, all memory allocation goes through a mutator function. The inputs arguments are the type of object you are creating, and a reference to a pointer type object that will then point to it. That pointer object is then updated at the time the new object is created. If an object is being allocated for the exclusive use of a C function in the interpreter runtime, then it is a root object. In this case, NULL is passed as the second argument, and the object is added to the list of root objects. Now later on, if that internal function no longer needs the object, it will then have to remove that object from the root-objects list. (it doesn't de-allocate the object itself, as that will be handled by the garbage collection routine eventually). Oh, and the interpreter stack itself is also an object within the interpreter (a list-type or array-type object), so a pointer to it is also in the root-objects list (again, another list-type object that is also known to the interpreter). A pointer to the root-objects list is the only pointer that the garbage collector needs to know about.

Also, as for when to start a garbage collection run -- since memory is effectively unlimited on modern architectures, I've decided to kick off the garbage collector when X number of objects has been allocated. After running you have Y objects left. If Y still greater than Z percent of X, then X gets bumped up enough to make that so. Then I just hope that malloc() never fails (if it does, I just dump an error out and exit the interpreter).

Hope this helps, and hopefully someone else will add more clarifications since I am more of an amateur when it comes to language / interpreter design.

So essentially, `Object *tmp = alloc(obj_type, &tmp)`. This allows the collector to store the address of `tmp` and change its value at any given time. I've honestly never played with a pointer like this, but it's nice. — joe, Apr 17 '13 at 17:13

score -1 · Answer 4 · answered Jan 07 '14 at 00:38

Some complications:

When you input a line to be interpreted like 100 if X then gosub 5000

But 5000 does not exist yet, you are spaghetti coding... Maybe x does not have any assigned value or data type yet. If we don't index now, are we going wait till someone types "run" or executes a line directly from the prompt?

if we do index now to speed things up later, how will we know the last instance of "100" or "X" or "5000" gets removed?

What entry do we make in the master index of "things"? Assuming these things may include lines of basic code, strings, and other variables we want to handle by name or line number.

We want to find quickly, and use to strategically identify garbage collection potential when the need for collection arises.

How much static space do we burn on the index of things that may change in size? Which details besides label, location, and length are useful enough to justify indexing? Should we attempt to index empty space when a variable shrinks? Or just index the variable's largest historical size along with its current size? How do we identify those variables that change in size most frequently, and should we avoid cleaning them, or even deliberately pad them?

When do we clean up the entire mess? or is it better to defrag only enough free space to squeeze in something that cannot be otherwise jammed sideways into an existing hole?

Purposeful delays and waiting for "input" seem good targets that we might exploit to proactively clean up some of the mess. There is no assurance any basic program will have such deadtime.

Sorry this is not an answer, but the original question seems to invite some brainstorming towards a better scheme. We need a clear strategy that requires defining the entire problem.

sorry, but SO is not a site meant for brainstorming, as you call it. This isn't a forum. This is a Q/A site. I know you don't have enough reputation to post comments anywhere, but this is still not a good example of what kind of answers you should post on SO. I won't downvote the answer because you're new to SO. — markasoftware, Jan 07 '14 at 01:05

Garbage collection issue in an interpreter implemented in C

4 Answers4