Separate mark bits work by having an array of bits where each bit represents an address in the heap that can start an object. For example, suppose the heap is 65536 bytes and all objects are aligned at 16 byte boundaries, then there are 4096 addresses in the heap that can be the start of an object. This means the array needs to contain 4096 bits, which can be efficiently stored as 512 bytes or 64 64bit sized unsigned integers.
In-object mark bits works by having one bit of each header of each object be set to 1 if the object is marked and 0 otherwise. Note that this requires each object to have a dedicated header area. Runtimes such as the JVM and .NET all add headers to objects so you essentially get the space for the mark bit for free.
But it doesn't work for conservative collectors which don't have full control of the environment they are running in, such as the Boehm GC. They can misidentify integers as pointers, so for them modifying anything in the mutators data heap is risky.
Mark & sweep garbage collection is divided into two phases: marking and sweeping. Marking using in-object mark bits is straight-forward (pseudo-code):
if not obj.is_marked():
obj.mark()
mark_stack.append(obj)
Using a separate array for storing mark bits, we have to convert the objects address and size to indices in the bit array and set the corresponding bits to 1:
obj_bits = obj.size_in_bytes() / 16
bit_idx = (obj - heap.start_address()) / 16
if not bitarr.bit_set(bit_idx):
bitarr.set_range(bit_idx, obj_bits)
mark_stack.append(obj)
So in our example, if an object is 128 bytes long, 8 bits will be set in the bit array. Clearly, using in-object mark bits is much simpler.
But separate mark bits gain some momentum when sweeping. Sweeping involves scanning through the whole heap and finding continuous regions of memory which is unmarked and therefore can be reclaimed. Using in-object mark bits, it would roughly look like this:
iter = heap.start_address()
while iter < heap.end_address():
# Scan til the next unmarked object
while iter.is_marked():
iter.unmark()
iter += iter.size()
if iter == heap.end_address():
return
# At an unmarked block
start = iter
# Scan til the next marked object
while iter < heap.end_address() and not iter.is_marked():
iter += iter.size()
size = iter - start
# Reclaim the block
heap.reclaim(start, size)
Note how the iteration jumps from object to object in the iter += iter.size()
lines. This means that the sweep phase running time is proportional to the total number of live and garbage objects.
Using separate mark bits, you would do roughly the same loop except that large swathes of garbage objects would be flown over without "stopping" on each of them.
Consider the 65536 heap again. Suppose it contains 4096 objects that are all garbage. Iterating the 64 64bit integers in the mark bits array and seeing that they are all 0 is obviously very fast. Therefore the sweeping phase can potentially be much faster with separate mark bits.
But there is another wrinkle! In any mark and sweep collector, the running time is dominated by the mark phase and not the sweep phase which is usually very quick. So the verdict is still out. Some prefer separate mark bits, other prefer in-object ones. To the best of my knowledge, no one has yet been able to show which approach is superior to the other.