24

Say I have the following trivial C header file:

// foo1.h
typedef int foo;

typedef struct {
  foo a;
  char const* b;
} bar;

bar baz(foo*, bar*, ...);

My goal is to take this file, and produce an LLVM module that looks something like this:

%struct.bar = type { i32, i8* }
declare { i32, i8* } @baz(i32*, %struct.bar*, ...)

In other words, convert a C .h file with declarations into the equivalent LLVM IR, including type resolution, macro expansion, and so on.

Passing this through Clang to generate LLVM IR produces an empty module (as none of the definitions are actually used):

$ clang -cc1 -S -emit-llvm foo1.h -o - 
; ModuleID = 'foo1.h'
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-darwin13.3.0"

!llvm.ident = !{!0}

!0 = metadata !{metadata !"clang version 3.5 (trunk 200156) (llvm/trunk 200155)"}

My first instinct was to turn to Google, and I came across two related questions: one from a mailing list, and one from StackOverflow. Both suggested using the -femit-all-decls flag, so I tried that:

$ clang -cc1 -femit-all-decls -S -emit-llvm foo1.h -o -
; ModuleID = 'foo1.h'
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-darwin13.3.0"

!llvm.ident = !{!0}

!0 = metadata !{metadata !"clang version 3.5 (trunk 200156) (llvm/trunk 200155)"}

Same result.

I've also tried disabling optimizations (both with -O0 and -disable-llvm-optzns), but that made no difference for the output. Using the following variation did produce the desired IR:

// foo2.h
typedef int foo;

typedef struct {
  foo a;
  char const* b;
} bar;

bar baz(foo*, bar*, ...);

void doThings() {
  foo a = 0;
  bar myBar;
  baz(&a, &myBar);
}

Then running:

$ clang -cc1 -S -emit-llvm foo2.h -o -
; ModuleID = 'foo2.h'
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-darwin13.3.0"

%struct.bar = type { i32, i8* }

; Function Attrs: nounwind
define void @doThings() #0 {
entry:
  %a = alloca i32, align 4
  %myBar = alloca %struct.bar, align 8
  %coerce = alloca %struct.bar, align 8
  store i32 0, i32* %a, align 4
  %call = call { i32, i8* } (i32*, %struct.bar*, ...)* @baz(i32* %a, %struct.bar* %myBar)
  %0 = bitcast %struct.bar* %coerce to { i32, i8* }*
  %1 = getelementptr { i32, i8* }* %0, i32 0, i32 0
  %2 = extractvalue { i32, i8* } %call, 0
  store i32 %2, i32* %1, align 1
  %3 = getelementptr { i32, i8* }* %0, i32 0, i32 1
  %4 = extractvalue { i32, i8* } %call, 1
  store i8* %4, i8** %3, align 1
  ret void
}

declare { i32, i8* } @baz(i32*, %struct.bar*, ...) #1

attributes #0 = { nounwind "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-realign-stack" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
attributes #1 = { "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-realign-stack" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = metadata !{metadata !"clang version 3.5 (trunk 200156) (llvm/trunk 200155)"}

Besides the placeholder doThings, this is exactly what I want the output to look like! The problem is that this requires 1.) using a modified version of the header, and 2.) knowing the types of things in advance. Which leads me to...

Why?

Basically, I'm building an implementation for a language using LLVM to generate code. The implementation should support C interop by specifying C header files and associated libs only (no manual declarations), which will then be used by the compiler before link-time to ensure that function invocations match their signatures. Hence, I've narrowed the problem down to 2 possible solutions:

  1. Turn the header files into LLVM IR/bitcode, which can then get the type signature of each function
  2. Use libclang to parse the headers, then query the types from the resulting AST (my 'last resort' in case there is no sufficient answer for this question)

TL;DR

I need to take a C header file (such as the above foo1.h) and, without changing it, generate the aforementioned expected LLVM IR using Clang, OR, find another way to get function signatures from C header files (preferrably using libclang or building a C parser)

Community
  • 1
  • 1
Kyle Lacy
  • 2,278
  • 1
  • 21
  • 29
  • 2
    If I understand correctly, it does not emit IR until you actually use the declared stuff. So instead of parsing yourself you could change Clang's behaviour to emit IR even if declarations are not used. This might be easier than dealing with ASTs yourself. Most likely you will have to flip a boolean somewhere in the code, where exactly, I dunno. – shrm Jul 14 '14 at 16:18
  • 2
    @mishr Yes, I do believe you're right that it would be easier to modify Clang's source and simply toggle a switch than to use libclang's AST. However, it's certainly less practical long-term to rely on a custom fork of such a large project (since each successive version of Clang would need to be patched). Not to mention, this question is for part of an open source project, so demanding others who will use it to install a specific forked version of Clang seems unfair – Kyle Lacy Jul 14 '14 at 20:04
  • Seems like you are not alone: http://stackoverflow.com/questions/14032496/how-can-i-code-generate-unused-declarations-with-clang – Matthias Jul 18 '14 at 10:40
  • 3
    Found this: http://clang-developers.42468.n3.nabble.com/On-preserving-unused-file-local-definitions-at-O0-td4038825.html. Gist: it is up to discussion in the clang dev community if -O0 should throw away unused items. -femit_all_decls is poorly named, controls emission of definitions, not declarations. I would go for submitting a patch that fixes the behavior in your favor. – Matthias Jul 18 '14 at 10:43
  • There's little to do: if modifying the source isn't an option then you can either push on the dev mailing list for this or find a workaround for your code – Marco A. Aug 01 '14 at 23:03
  • 1
    Did you ever figure this out? I too am writing my own language that's meant to interoperate with existing code, so I need some way to refer to structs/functions from existing files. I have thought of having my build system just compile everything (my language + the other language) to LLVM IR and just link it all together with `llvm-link`, but I don't know yet if that will work. I really don't want to do any additional parsing, and the answer below is just downright nasty :( – Thomas Oct 29 '16 at 09:22
  • @Thomas Unfortunately I never found a solution I was really happy with. I think your best bet would be to start with `libclang` to parse the C files (but I think you'd still need to produce the LLVM IR manually in that case...) – Kyle Lacy Oct 30 '16 at 21:48
  • @KyleLacy I think I am going to first work with a hardcoded set of dependencies so I can move forward, and then try to parse the external IR. If I can't, then libclang it is I suppose... thanks for your answer – Thomas Oct 31 '16 at 09:37

1 Answers1

5

Perhaps the less elegant solution, but staying with the idea of a doThings function that forces the compiler to emit IR because the definitions are used:

The two problems you identify with this approach are that it requires modifying the header, and that it requires a deeper understanding of the types involved in order to generate "uses" to put in the function. Both of these can be overcome relatively simply:

  1. Instead of compiling the header directly, #include it (or more likely, a preprocessed version of it, or multiple headers) from a .c file that contains all the "uses" code. Straightforward enough:

    // foo.c
    #include "foo.h"
    void doThings(void) {
        ...
    }
    
  2. You don't need detailed type information to generate specific usages of the names, matching up struct instantiations to parameters and all that complexity as you have in the "uses" code above. You don't actually need to gather the function signatures yourself.

    All you need is the list of the names themselves and to keep track of whether they're for a function or for an object type. You can then redefine your "uses" function to look like this:

    void * doThings(void) {
        typedef void * (*vfun)(void);
        typedef union v { void * o; vfun f; } v;
    
        return (v[]) {
            (v){ .o = &(bar){0} },
            (v){ .f = (vfun)baz },
        };
    }
    

    This greatly simplifies the necessary "uses" of a name to either casting it to a uniform function type (and taking its pointer rather than calling it), or wrapping it in &( and ){0} (instantiating it regardless of what it is). This means you don't need to store actual type information at all, only the kind of context from which you extracted the name in the header.

    (obviously give the dummy function and the placeholder types extended unique names so they don't clash with the code you actually want to keep)

This simplifies the parsing step tremendously since you only have to recognise the context of a struct/union or function declaration, without actually needing to do very much with the surrounding information.


A simple but hackish starting point (which I would probably use because I have low standards :D ) might be:

  • grep through the headers for #include directives that take an angle-bracketed argument (i.e. an installed header you don't want to also generate declarations for).
  • use this list to create a dummy include folder with all of the necessary include files present but empty
  • preprocess it in the hope that'll simplify the syntax (clang -E -I local-dummy-includes/ -D"__attribute__(...)=" foo.h > temp/foo_pp.h or something similar)
  • grep through for struct or union followed by a name, } followed by a name, or name (, and use this ridiculously simplified non-parse to build the list of uses in the dummy function, and emit the code for the .c file.

It won't catch every possibility; but with a bit of tweaking and extension, it probably will actually deal with a large subset of realistic header code. You could replace this with a dedicated simplified parser (one built to only look at the patterns of the contexts you need) at a later stage.

Alex Celeste
  • 12,824
  • 10
  • 46
  • 89