Do references get updated when Garbage Collectors move data in heap?

Question

I read that GC (Garbage Collectors) moves data in Heap for performance reasons, which I don't quite understand why since it is random access memory, maybe for better sequential access but I wonder if references in Stack get updated when such a move occurs in Heap. But maybe the offset address remains the same but other parts of data get moved by Garbage Collectors, I am not sure though.

I think this question pertains to implementation detail since not all garbage collectors may perform such optimization or they may do it but not update references (if it is a common practice among garbage collector implementations). But I would like to get some overall answer specific to CLR (Common Language Runtime) garbage collectors though.

And also I was reading Eric Lippert's "References are not addresses" article here, and the following paragraph confused me little bit:

If you think of a reference is actually being an opaque GC handle then it becomes clear that to find the address associated with the handle you have to somehow "fix" the object. You have to tell the GC "until further notice, the object with this handle must not be moved in memory, because someone might have an interior pointer to it". (There are various ways to do that which are beyond the scope of this screed.)

It sounds like for reference types, we don't want data to be moved. Then what else we store in the heap, which we can move around for performance optimization? Maybe type information we store there? By the way, in case you wonder what that article is about, then Eric Lippert is comparing references to pointers little bit and try to explain how it may be wrong to say that references are just addresses even though it is how C# implements it.

And also, if any of my assumptions above is wrong, please correct me.

If I recall correctly, yes. There's a "relocating" phase in the GC that moves all the objects to remove/reduce memory fragmentation and at this phase, the references to the moved objects get updated. I'll try to find the link from Channel9 (or maybe from an MSDN article) and will update this comment. Edit: Here's the link: http://msdn.microsoft.com/en-us/library/ee787088(v=vs.110).aspx#what_happens_during_a_garbage_collection (have a look at the relocating phase). — kha, Dec 23 '14 at 09:01
@AdamHouldsworth: But my question is to learn how that happens: Does it maintain the reference values by updating them when the whole object is moved to some other memory addresses, or by simply not moving the initial address of the object array so that it doesn't need to change reference value. — Tarik, Dec 23 '14 at 09:01
@Tarik updated with a link. Let me know if it's not sufficient and I'll dig further. IIRC , it's explained very well and in details in CLR Via C# (an excellent book I may add) — kha, Dec 23 '14 at 09:03
@Tarik Indeed, which is why I haven't posted an answer. I'm waiting for an answer too. — Adam Houldsworth, Dec 23 '14 at 09:04
Another link with a lot more detail (I don't think you can get more details than this): http://www.informit.com/articles/article.aspx?p=1409801&seqNum=2 This is the bit you may find interesting: `When the garbage collection occurs, the memory occupied by objects B and D is reclaimed,which leads to gaps on the managed heap. To remove these gaps, the garbage collector compacts the remaining live objects (Obj A, C, and E) and coalesces the two free blocks (used to hold Obj B and D) into one free block. Lastly, the current allocation pointer is updated as a result of the compacting and coalescing` — kha, Dec 23 '14 at 09:14
You misinterpreted the quoted paragraph. What it's saying, in a round-about way, is that there must be some mechanism to assure that the address inside a reference does not change *while the reference is actively being used to manipulate the pointed-to object*. There are various ways to provide this assurance. — Hot Licks, Dec 23 '14 at 14:02

score 18 · Accepted Answer · edited May 23 '17 at 12:32

Yes, references get updated during a garbage collection. Necessarily so, objects are moved when the heap is compacted. Compacting serves two major purposes:

it makes programs more efficient by using the processor's data caches more efficiently. That is a very, very big deal on modern processors, RAM is exceedingly slow compared to the execution engine, a fat two orders of magnitude. The processor can be stalled for hundreds of instructions when it has to wait for RAM to supply a variable value.
it solves the fragmentation problem that heaps suffer from. Fragmentation occurs when a small object is released that is surrounded by live objects. A hole that cannot be used for anything else but an object of equal or smaller size. Bad for memory usage efficiency and processor efficiency. Note how the LOH, the Large Object Heap in .NET, does not get compacted and therefore suffers from this fragmentation problem. Many questions about that at SO.

In spite of Eric's didactic, an object reference really is just an address. A pointer, exactly the same kind you'd use in a C or C++ program. Very efficient, necessarily so. And all the GC has to do after moving an object is update the address stored in that pointer to the moved object. The CLR also permits allocating handles to objects, extra references. Exposed as the GCHandle type in .NET, but only necessary if the GC needs help determining if an object should stay alive or should not be moved. Only relevant if you interop with unmanaged code.

What is not so simple is finding that pointer back. The CLR is heavily invested in ensuring that can be done reliably and efficiently. Such pointers can be stored in many different places. The easier ones to find back are object references stored in a field of an object, a static variable or a GCHandle. The hard ones are pointers stored on the processor stack or a CPU register. Happens for method arguments and local variables for example.

One guarantee that the CLR needs to provide to make that happen is that the GC can always reliably walk the stack of a thread. So it can find local variables back that are stored in a stack frame. Then it needs to know where to look in such a stack frame, that's the job of the JIT compiler. When it compiles a method, it doesn't just generate the machine code for the method, it also builds a table that describes where those pointers are stored. You'll find more details about that in this post.

It's entertaining to imagine a world in which references are addresses indirected through handles that are indexes into a table of addresses. Sure, every access gets slightly slower, but we pay the exact same penalty for virtual function calls without stressing out about it. When implementing such a scheme you soon realize that when an object is deallocated you get a hole in the address table, and now you're back to the same problem as before: how to get rid of the holes to keep the table small. There is no free lunch here! I like to ask a variation on this as an interview problem. — Eric Lippert, Dec 29 '14 at 15:51
@EricLippert, The original Apple operating system for 68K models (Lisa/MacIntosh) worked like that, i.e. double indirection for memory access. — adrianm, Dec 30 '14 at 09:08
@adrianm: Even in C++ it's not hard to imagine situations where a scheme could be advantageous. I've been toying with writing an embedded-ARM string library using that principle, targeting machines with under 128K of RAM (in many cases under 16K). Having each C++ object hold a 16-bit handle [with 128K of RAM, it's a safe bet the number of strings will be small enough to fit] which then references a 16-bit string-pool scaled offset, and requiring that only one C++ object identify any handle (which must be constructed/destructed) but allowing multiple handles to identify a string, should... — supercat, Mar 23 '15 at 15:45
...work pretty well. I envision a cost of four bytes per reference (two in the pool, two outside); strings instances up to 64 bytes would cost one byte plus the string content. Longer strings would involve time/space trade-offs (I'd probably have a "flat array" type which would behave as one string but be stored as an array of strings to be concatenated, with all but the first and last being grouped into chunks of 32-63 characters). Thus concatenation and substring-extraction would often require creating new short string instances for the start/end of the new string, or the... — supercat, Mar 23 '15 at 15:49
...joint between two strings, but large portions of existing strings could be reused. If handles cost four bytes each, trying to support sharing with reference counts wouldn't make any sense. — supercat, Mar 23 '15 at 15:50

Yuval Itzchakov · Answer 2 · 2014-12-23T11:20:58.807

Looking at C++\CLI In Action, there's a section about interior pointers vs pinning pointers:

C++/CLI provides two kinds of pointers that work around this problem. The first kind is called an interior pointer, which is updated by the runtime to reflect the new location of the object that's pointed to every time the object is relocated. The physical address pointed to by the interior pointer never remains the same, but it always points to the same object. The other kind is called a pinning pointer, which prevents the GC from relocating the object; in other words, it pins the object to a specific physical location in the CLR heap. With some restrictions, conversions are possible between interior, pinning, and native pointers.

From that, you can conclude that reference types do move in the heap and their addresses do change. After the Mark and Sweep phase, the objects get compacted inside the heap, thus actually moving to new addresses. The CLR is responsible to keep track of the actual storage location and update those interior pointers using an internal table, making sure that when accessed, it still points to the valid location of the object.

There's an example taken from here:

ref struct CData
{
    int age;
};

int main()
{
    for(int i=0; i<100000; i++) // ((1))
        gcnew CData();

    CData^ d = gcnew CData();
    d->age = 100;

    interior_ptr<int> pint = &d->age; // ((2))

    printf("%p %d\r\n",pint,*pint);

    for(int i=0; i<100000; i++) // ((3))
        gcnew CData();

    printf("%p %d\r\n",pint,*pint); // ((4))
    return 0;
}

Which is explained:

In the sample code, you create 100,000 orphan CData objects ((1)) so that you can fill up a good portion of the CLR heap. You then create a CData object that's stored in a variable and ((2)) an interior pointer to the int member age of this CData object. You then print out the pointer address as well as the int value that is pointed to. Now, ((3)) you create another 100,000 orphan CData objects; somewhere along the line, a garbage-collection cycle occurs (the orphan objects created earlier ((1)) get collected because they aren't referenced anywhere). Note that you don't use a GC::Collect call because that's not guaranteed to force a garbage-collection cycle. As you've already seen in the discussion of the garbage-collection algorithm in the previous chapter, the GC frees up space by removing the orphan objects so that it can do further allocations. At the end of the code (by which time a garbage collection has occurred), you again ((4)) print out the pointer address and the value of age. This is the output I got on my machine (note that the addresses will vary from machine to machine, so your output values won't be the same):

012CB4C8 100
012A13D0 100

Is this how we can have an interior?: `var person = new Person(); var name = person.Name;`, so does `name` become an interior pointer type of reference in C#? — Tarik, Dec 23 '14 at 16:30
In C#, the concept of "interior pointer" doesn't really exist, only inside unsafe code. You can look it at logically is an "interior pointer" which will always reference the correct object address throughout allocation. When doing unsafe code, you retrieve an interior pointer, not a native pointer. — Yuval Itzchakov, Dec 23 '14 at 16:47

Do references get updated when Garbage Collectors move data in heap?

2 Answers2

Linked