2

While working on a compiler for a toy language I designed, I looked a bit around about what options there are for implementing generics in a language (by searching examples of existing languages) and I started wondering about C# generics.

I'll try to describe what I've understood first. Please feel free to correct me at any point if I've misunderstood anything. I will be using the term generic class/type/template/definition to refer to something like List<T> and concrete class/type to refer to something like List<int>, List<string>, etc.

Apologies in advance about the length of the post.


C++ uses templates for generic programming. As far as I understand, this means that:

  1. The C++ compiler, when it encounters a template definition, simply keeps its text (or an equivalent of the text) in memory.
  2. Then, every time a reference to a (new) concrete type is encountered in the code, the generic template is consulted, the text for the requested concrete type is generated (by replacing the type parameters with the requested combination), and finally that code is compiled and added to the binary.

The template itself will not be available in the binary, since it was merely a text representation and never a concrete type. Only the concrete types generated from it will be visible.

(I am unsure about this last part - as I stated before, do correct me if I'm off.)


Java uses type erasure, meaning that all generic type parameters are checked for type safety only at compile time, then (if no type mismatch is detected), they are all replaced with references to the Object base type, effectively reusing one (non-generic) class for all concrete type references.


Now, to the actual question.

After reading an interview of Anders Hejlsberg (not specifically the linked one, but his point is the same), where he criticized Java's type erasure, I had assumed that C# does not use type erasure. And, since we can reflect on the concrete types in C# and we do have such things as LINQ (which involves rather complicated usages of the generic capabilities of C#), we can safely say that C# indeed does not employ type erasure like Java.

Not knowing of any other options, I assumed that C# uses something like templates. Not quite C++ templates (as I understand them, at least), because we can obviously create a concrete type whose generic version was described in another assembly, and because we can reflect on the type again and get information about it. So, what I thought was that C# generics were more like templates plus metadata.

At some point, I read somewhere (though I can't find the link) that C# generic classes are actually abstract classes behind the scenes. Obviously, this wasn't in agreement with what I thought I knew - weird.

I forgot about it for a few months, but today I stumbled upon a question on SO very similar to this one (which, unfortunately, has no definitive answer). The author of that question demonstrates that the C# compiler doesn't do method resolution at compile time for generics even for types that can be known at compile time (shown using the new method he created to shadow object.GetHashCode).

Okay, so C# generics definitely aren't "text replacement plus metadata", as I originally thought. If it was, then in that question the (textually generated) concrete type for Test would have lead the compiler to resolve the GetHashCode call very differently.

But instead, the C# compiler resolves it as if the type was nothing more than object, for all concrete implementations of the generic type, including the one where the new GetHashCode would have been resolved if it was non-generic code. This makes it look like C# generics are closer to type erasure plus metadata. Now I know the term is not very apt: it's not really type erasure if the metadata of every concrete class maintain the type parameter information - but it does resemble Java's methodology of storing everything as an object (at least for reference types, which are essentially interchangeable at a low level) and casting back and forth.

I've tried to imagine the third possibility (for which, as I said, I can't find any sources - so take it with a grain of salt), that generic classes are represented as abstract classes behind the scenes, which are simply extended and intelligently specialized by the compiler every time a new concrete type is to be generated, but I can't fully understand how it would work in practice. For example, the semantics of the modifier sealed for a generic class would have to be "shifted" to allow for that class (in the assembly) to be extended, but not its children. Generally, I think it would make the compiler (and possibly the runtime too) very complicated in ways I can't even begin to understand.

So, how are generics really implemented in C#? Although they definitely have differences, are they closer to those of C++ or closer to those of Java? Or is the generic definition really represented as a special abstract class in the assembly? Or is it perhaps something entirely different from what I've described?

I'm not looking for a particularly detailed answer (although it would be welcome). An explanation in simple terms would be fine, as long as it clearly highlights the differences between C#/Java/C++ and offers me at least a theoretical knowledge of how the C# compiler and runtime tackle generic classes.


Edit #1: I am aware that Eric Lippert has highlighted differences between C++ templates and C# generics in at least one blog post, essentially saying "C# generics are not templates" but I'm not aware of any explanation as to what they are behind the scenes.


Edit #2: The answer to the linked question does not address the issue I'm asking about. That answer briefly explains a specific example or phenomenon that's the result of the underlying implementation, but it absolutely does not explain what's being asked: what the underlying implementation is.

Community
  • 1
  • 1
Theodoros Chatzigiannakis
  • 28,773
  • 8
  • 68
  • 104
  • 1
    More precisely in C++ case: it's the preprocessor that generates the code from templates, that is being then compiled. You're right that there is no trace of the template in the compilation result (well, there is indirectly - in the names of generated classes). – BartoszKP Jan 29 '14 at 19:45
  • This is a duplicate of any number of blog posts from Eric Lippert et. al. – John Saunders Jan 29 '14 at 19:47
  • @JohnSaunders Please see my edit (at the bottom). If you can link to an explanation on what C# generics really are behind the scenes, I'd be glad to read it and accept it as an answer. – Theodoros Chatzigiannakis Jan 29 '14 at 19:51
  • 1
    I think its more of a IL issue. Maybe the answer you are looking for is in how JIT-er compiles generic IL methods. For c# compiler as i see it, there is not much difference between normal type and generic type.Generics support is baked into CLR, i don't think you can observe it only as a C# compiler feature. – Jurica Smircic Jan 29 '14 at 20:21
  • 1
    The underlying implementation is: the C# compiler generates IL. The JIT compiler generates machine code from the IL. When the jitter first encounters a generic method or method of generic type instantiated with a particular set of type parameters, it generates fresh machine code for that instantiation. The jitter is clever enough to re-use existing machine code for all type arguments that are reference types. That is, if `List.Add` has been jitted then a call to `List.Add` calls the code for `List.Add`. Do you see why that's legal? – Eric Lippert Jan 30 '14 at 18:34
  • 1
    But `List.Add` and `List.Add` are each jitted on their own; type arguments of value types trigger fresh code generation. This is in contrast to both C++, where all the compilation is done at compile time, and to Java, where effectively `List` is treated as `List` -- the ints are boxed. – Eric Lippert Jan 30 '14 at 18:37
  • @EricLippert Thanks! So, in essence, it's one class reused for reference types (because they are interchangeable at the runtime level, I presume), one class for each of the used value types (I imagine mainly because of their different sizes - to avoid the penalty of boxing them to references), plus some generics metadata so that other assemblies are able to emit their own types from the parameterized version? – Theodoros Chatzigiannakis Jan 30 '14 at 20:12
  • @TheodorosChatzigiannakis: That's pretty much it. The reference types are interchangable because any reference types that would be incompatible have already been ruled out at compile time by the C# compiler. – Eric Lippert Jan 30 '14 at 20:13
  • @EricLippert That makes sense. In the example of the linked question, however, why doesn't the compiler see that another call instruction should/could have been emitted in that case? I think that even at compile time it has enough information to deduce that two versions could be made for the reference types - one for `Test` and subclasses (calling the newer `GetHashCode`) and one for the rest of the reference types (calling the original from `object`)? Is there a deeper language or runtime design reason behind this? – Theodoros Chatzigiannakis Jan 30 '14 at 20:18

1 Answers1

2

See here: for how they work internally : http://microsoftdev.blogspot.in/2007/07/how-generics-work.html

and the difference between C++ templates and C# generics is here:

http://msdn.microsoft.com/en-us/library/c6cyy67b.aspx

Aniket Inge
  • 25,375
  • 5
  • 50
  • 78