I want to learn the theory behind garbage collection. How do i go about it?
I am also a dabbler interested in garbage collection (to the point that I wrote my own garbage collected VM called HLVM). I learned by reading as many research papers on garbage collection as I could get my hands on and by playing with the ideas myself, both raw in my virtual machine and also by writing memory-safe high-level simulations.
The obvious answer is - a compiler textbook... The question is, is it necessary to learn lexical analysis, parsing and other stuff that usually precedes garbage collection in a text?
The lexical analysis, parsing and other stuff is not relevant to garbage collection. You might get an outdated cursory overview of garbage collection from a compiler book but you need to read research papers to get an up-to-date view, e.g. with regard to multicore.
In short, what are the prerequisites to learning about Garbage collection theory?
You need to know about basic graph theory, pointers, stacks, threads and (if you're interested in multi-threading) low-level concurrency primitives such as locks.
Garbage collection is all about determining reachability. When a program can no longer obtain a reference to a value because that value has become unreachable then the GC can recycle the memory that value is occupying. Reachability is determined by traversing the heap starting from a set of "global roots" (global variables and pointers on the thread's stacks and in the core's registers)
GC design has many facets but you might begin with the four main garbage collection algorithms:
- Mark-and-sweep (McCarthy, 1960)
- Mark-and-compact (Haddon and Waite, 1967)
- Stop-and-copy (Cheney, 1970)
- Mark-region (McKinley et al., 2007)
Perhaps the most notable evolution of these basic ideas is generational garbage collection, which was the defacto standard design for many years.
My personal feeling is that some of the obscure work on garbage collection conveys just as much useful information so I'd also highly recommend:
You might also like to study the three kinds of write barrier (Dijkstra's, Steele's and Yuasa's) and look at the card marking and remembered set techniques.
Then you might also like to examine the actual design decisions some implementors chose for language implementations like Java and .NET as well as the SML/NJ compiler for Standard ML, the OCaml compiler, the Glasgow Haskell compiler and others. The differences between the techniques adopted are as big as the similarities between them!
There are also some great tangentially-related papers like Henderson's Accurate Garbage Collection in an Uncooperative Environment. I used that technique to avoid having to write a stack walker for HLVM.
The memorymanagement.org website is an invaluable resource, particularly the glossary of GC-related terms.