Pragmatics of typed intermediate languages

Question

One trend in the compilation is to use typed intermediate languages. Haskell's ghc with its core intermediate language, a variant of System F-omega, is an example of this architecture [ 1 ]. Another is LLVM, which has a typed intermediate language at its core [ 2 ]. The benefit of this approach is that errors in the transformations that make up parts of the code generator can be detected early. In addition, the type information can be used during optimization and code generation.

For efficiency, typed IRs are type-checked, rather than have their type inferred. To make type-checks fast, each variable and each binder carry types for easy type-checking.

However, many transformations in the compiler pipeline may introduce new variables. For example, a normalization transformation K(.) might transform an application

M(N)

into an expression like

let x = K(M) in
let y = K(N) in x(y)

Question. I wonder how compilers handle the issue of giving types to newly introduced variables. Do they re-typecheck, in the example above K(M) and K(N)? Isn't that time-consuming? And does it require passing an environment around? Do they use maps from AST nodes to type information to avoid re-running type checking?

S. Marlow, S. Peyton Jones, The Glasgow Haskell Compiler.
LLVM Language Reference Manual.

[A-normal form](https://en.wikipedia.org/wiki/A-normal_form) might be relevant — Basile Starynkevitch, Dec 17 '15 at 14:38
@BasileStarynkevitch Thanks. A-normal form is important, but is there a theory of how efficiently to type-check / type-infer A-normal forms? — Martin Berger, Dec 17 '15 at 14:59
My approach is to re-run type propagation after transforms instead of faithfully carrying all the type information. Some transforms may strip the types altogether (as long as they're recoverable). Type propagation on something as simple as an IR is by far the least possible performance bottleneck. — SK-logic, Dec 18 '15 at 13:51
@SK-logic Thanks, that's very useful advice. Type-reconstruction should be fast, and it leads to clean compiler code. How do you suggest to deal with the environment problem? Do you carry it around in each transform, or is the type information carried by the nodes representing variables enough? — Martin Berger, Dec 18 '15 at 14:05
@MartinBerger, I prefer to keep everything in a readable (and serialisable) IR. In an IR similar to LLVM, for example, I'm keeping sticky types with the `alloca` nodes, function arguments and global variable declarations. The rest can always be reconstructed. For convenience I also stick types to the GEP nodes in between transforms, but if there is a transform that can introduce new GEPs I re-run the type propagation. — SK-logic, Dec 18 '15 at 23:02
@SK-logic Thanks. I'll try that. What about adding unique AST-node IDs and maintaining maps from AST nodes to type (and other) information? That also appears to be a clean division of labour. — Martin Berger, Dec 19 '15 at 16:03
@MartinBerger, I'm doing this for typing the higher level IRs (e.g., enumerating expression nodes with unique variable names and then solving type equations against these free variables). But for the highly volatile low level IRs it is a bit messy, it's easy to introduce new nodes or rewrite the existing ones without updating the mapping. Keeping everything in a single, readable and serialisable AST helps a lot. — SK-logic, Dec 21 '15 at 11:21
@SK-logic If you convert your comments into an answer, I can award the bounty to you. — Martin Berger, Dec 27 '15 at 09:16
This question seems too abstract for SO. Compiler design might be on topic on Software Engineering SE, or another site in the network, but it doesn't seem sufficiently answerable as a Stack Overflow question. — Todd A. Jacobs, Jul 19 '22 at 17:50
@ToddA.Jacobs I think it can be answered by explaining how a concrete compiler such as clang or ocamlc or scalac do this? — Martin Berger, Jul 20 '22 at 11:14

score 1 · Answer 1 · answered Oct 20 '22 at 05:09

Do they re-typecheck, in the example above K(M) and K(N)?

Yes, they do. It's not that bad, though. The typechecker knows that K(M) is an application of K to M. It knows what the type of K is, and that should be a function type. It knows what the type of M is, and it can check that that's the same as the input type of the function. So it knows that K(M) has the output type of K. The typechecker also knows that K(N) is an application of K to N. It knows that K(N) has the same type as K(M). And it knows that x(y) is an application of x to y. It knows that x has a function type and that y has the same type as x's input type. So it knows that x(y) has the same type as x's output type. So it knows that the entire expression has the same type as K's output type.

Isn't that time-consuming?

Not really. The typechecker doesn't have to check the entire expression before it can start typechecking. It can check subexpressions as it goes, and it can do some caching to avoid rechecking subexpressions that it's already checked.

And does it require passing an environment around? Do they use maps from AST nodes to type information to avoid re-running type checking?

Yes, they do. And they do use maps from AST nodes to type information, but they don't use them to avoid re-running type checking. They use them to avoid running the same type checking on the same expression twice. (That might be the same thing, I'm not entirely sure.)

Pragmatics of typed intermediate languages

1 Answers1