21

I checked Boehm GC. The GC for C/C++.

I know mark-and-sweep algorithm. What I'm in curious is how it picks up only pointers in whole C memory. My understanding about C memory is just a plain byte array. Is it possible to determine a value in memory is pointer or not?

finnw
  • 47,861
  • 24
  • 143
  • 221
eonil
  • 83,476
  • 81
  • 317
  • 516

1 Answers1

24

The Boehm GC is a conservative collector, which means it assumes everything (on pointer boundaries) is a pointer. This means that it can find false positive references, like an integer which coincidentally has the value of an address in the heap. As a result, some blocks may stay in memory longer than they would with a non-conservative collector.

Here's a description from Boehm's page:

The garbage collector uses a modified mark-sweep algorithm. Conceptually it operates roughly in four phases, which are performed occasionally as part of a memory allocation:

  1. Preparation Each object has an associated mark bit. Clear all mark bits, indicating that all objects are potentially unreachable.
  2. Mark phase Marks all objects that can be reachable via chains of pointers from variables. Often the collector has no real information about the location of pointer variables in the heap, so it views all static data areas, stacks and registers as potentially containing pointers. Any bit patterns that represent addresses inside heap objects managed by the collector are viewed as pointers. Unless the client program has made heap object layout information available to the collector, any heap objects found to be reachable from variables are again scanned similarly.
  3. Sweep phase Scans the heap for inaccessible, and hence unmarked, objects, and returns them to an appropriate free list for reuse. This is not really a separate phase; even in non incremental mode this is operation is usually performed on demand during an allocation that discovers an empty free list. Thus the sweep phase is very unlikely to touch a page that would not have been touched shortly thereafter anyway.
  4. Finalization phase Unreachable objects which had been registered for finalization are enqueued for finalization outside the collector.

You should also know that the Boehm GC needs to be given a set of "roots", which are starting points for the mark-and-sweep algorithm. The stack and registers are automatically roots. You need to explicitly add global pointers as roots.


EDIT: In comments, some concerns were pointed out about conservative collectors in general. It is true that integers that look like heap pointers to the collector can cause memory not to be released. This is not as big of a problem as you might think. Most scalar integers in a program are used for counts and sizes and are fairly small (so they would not look like heap pointers). You would mostly run into problems with arrays containing bitmaps, strings, floating point data, or anything of that sort. Boehm GC lets you allocate a block with GC_MALLOC_ATOMIC which indicates to the collector that the block will not contain any pointers. If you look in gc_typed.h, you will also find ways to specify what parts of a block may contain pointers.

That said, a fundamental limitation of a conservative collector is that it cannot safely move memory around during collection since pointer rewriting is not safe. This means you won't get any of the benefits of compaction like lowered fragmentation and improved cache performance.

rogerdpack
  • 62,887
  • 36
  • 269
  • 388
Jay Conrod
  • 28,943
  • 19
  • 98
  • 110
  • 7
    It's probably as good as you can get without some assistance from the compiler (telling you where the pointers are). – Jay Conrod Jan 25 '11 at 17:08
  • 11
    @codymanix: Your comment is the current top entry for my daily "I wish I could down-vote comments" contest. And from its "quality" you have a great chance of becoming the day's winner, if not the week's. – sbi Jan 25 '11 at 17:26
  • 2
    @sbi: I was just saying that a GC which doesn't work (leaks memory) is worth nothing. If have have a GC you must be able to rely on it. If not, it makes things worse because programmers start to omit calls to free() because "it worked when I tested it". – codymanix Jan 26 '11 at 12:13
  • 7
    @codymanix: by that definition, all GC algorithms are garbage, since all of them will leak memory in some situations. In general though, the amount of leaked garbage is limited (doesn't grow over time), and careful programming can avoid/reduce that, so it works out ok. – Chris Dodd Jan 26 '11 at 18:24
  • The link included in the answer is now out-of-date; the proper link should be [gc_types.h](http://hboehm.info/gc/gc_source/gc_typedh.txt). – Jeremy Rodi May 14 '16 at 18:16
  • @JayConrod sorry for digging up age old stuff :V – Jeremy Rodi May 14 '16 at 20:37
  • @codymanix, would it collect itself? or is it too conservative? – thang Aug 02 '16 at 22:39
  • 1
    @ChrisDodd: A collector like the one in Java or .NET won't leak memory to which no strong rooted references exist. If user code attaches event handlers to static or long-lived objects and fails to detach them, that may leave useless rooted strong references, but that's a fault in the code that failed to detach the event--not the GC. – supercat Mar 14 '19 at 21:35
  • 1
    @supercat: If the user is required to manually manage all references and ensure that no unneeded ones remain, why use garbage collection at all? Just let the user deal with it by explicitly freeing things that are no longer needed. I've lost count of the number of java programs I've seen with this kind of problem. – Chris Dodd Mar 15 '19 at 00:09
  • @supercat: More to the point though, keeping around a small bounded amout of garbage is not a problem. Leaks are only a problem if they can grow unbounded over time and are not eventually cleaned up. Using ANY garbage collector generally has a significant (fixed) space overhead compared to manual management. – Chris Dodd Mar 15 '19 at 00:15
  • 2
    @ChrisDodd: RAII is better than tracing GC for objects with owners. Tracing GC is better than RAII or reference counting for immutable objects without owners. In multi-threaded code when not using a tracing GC, determining when the last reference to an immutable string gets destroyed requires synchronization among all the threads that might be using it. In a tracing GC, sharing a string reference among threads requires nothing more than physically copying the reference and having the code include metadata to indicate what CPU registers or stack slots hold references when. – supercat Mar 15 '19 at 14:11
  • 1
    @ChrisDodd: Tracing garbage collectors can support object relocation and defragmentation, which may yield better storage efficiency than allocation methods where objects can't move. They also allow significant trade-offs between storage efficiency and time-efficiency. A lightweight tracing-GC string library for e.g. an embedded ARM with a pool size of 128KiB could support a `string` object that's two bytes, with each string reference using two bytes in the pool, and each *different* string using length+2 bytes, rounded up to the next halfword. Pretty cheap I'd say. – supercat Mar 15 '19 at 14:25
  • 2
    @supercat: You seem to be conflating copying collectors with tracing collectors. Mark-sweep tracing collectors have all the same advantages, with the exception of compacting. The memory saving from compacting is a small constant factor, often less than 10%. The overhead from a 2-space copying collector is 100% You can reduce that by using mark-compact instead of simple copying, but that has its own costs. Bottom line, the "extra overhead" from a conservative collector (like Boehm) is almost lost in the noise. – Chris Dodd Mar 15 '19 at 23:44