2

Suppose I have a C source file (or a language close enough to C).

  • It may have #include directives.
  • It may have other preprocessor directives.
  • It may have zero, one or multiple function declarations with a definition.
  • It may also have declarations without definitions
  • The syntax is valid and the file would compile if its includes were made (But you can't really rely on that without applying the preprocessor and making the includes, which you won't).

Now, what is an efficient, hopefully established, way of programattically locating function signatures in this file, preferably with some kind of parse tree or syntax tree for the signature - without recourse to any other files?

Obviously this cannot be 100% fool-proof: One could use weird preprocessor tricks to "hide" or obfuscate declarations; one could use crazy spacing and indentation etc. But - I only need to catch "plain vanilla" function definitions with no funny business.

Notes:

  • The program which needs to locate C headers can be C++ or C. This question is "cleaner" if I said I want to do it in C, but in reality I'm writing C++, so...
  • I'm trying to avoid something as heavy as cppast which would also probably need to be able to find those include files.
  • I think I might have had a related question at some point in time, but I can't find it...
  • You may assume the function does not have any function pointer parameters, nor does it return a function pointer.
  • The richer the AST the better, but I won't be too picky.
  • I was thinking maybe something like the parsers which IDEs use, which have to be somewhat robust against syntax errors and missing files, and have to produce something useful even for broken files. But that's just a flight of fancy.
einpoklum
  • 118,144
  • 57
  • 340
  • 684
  • For simple function declarations (i.e. that don't use complex types like function pointers etc.) then regular expressions might work. – Some programmer dude Jul 18 '22 at 09:58
  • Probably the first step is to skip everything inside {}, then split by semicolons, and then you have a much simpler language that is easier to parse. What you are doing is parsing by definition, so there is no "without parsing". I think your main concern here is making sure the parser is able to keep parsing the rest of the file after seeing some crazy macros that don't make sense to it. Probably after crazy macros, there will still be a ; after each function definition – user253751 Jul 18 '22 at 11:04
  • 1
    Why not use libclang? – the busybee Jul 18 '22 at 11:17
  • 1
    Do you mean something like [ctags](https://github.com/universal-ctags/ctags)? Or perhaps [cscope](http://cscope.sourceforge.net/)? – n. m. could be an AI Jul 18 '22 at 12:47
  • @n.1.8e9-where's-my-sharem.: See clarification. I want to do this within my program. But - those are interesting ideas. Maybe I can copy some of their code (or at worst - using `system()`... :-( ) – einpoklum Jul 18 '22 at 13:18
  • @Someprogrammerdude : That won't give me a parse tree as a result, I'm afraid. – einpoklum Jul 18 '22 at 13:21
  • @thebusybee: I don't know; can it do what I asked? I'm no clang/llvm expert. AFAIK, it can compile things for me, but my input can't compile since I can't perform the inclusions. Also, how easy is it to "sic" libclang on a C file and get a parse tree? – einpoklum Jul 18 '22 at 13:23
  • Well you could use the regular expressions to filter out the declarations you need, and then do a simple recursive-descent parsing (or use a Yacc/Bison/etc. parser) for the declarations to construct an AST? Could probably be done pretty lightweight. – Some programmer dude Jul 18 '22 at 13:27
  • 1
    Also, this feels very much like an XY problem... Why do you need the get the function signatures? Why do you need the AST? – Some programmer dude Jul 18 '22 at 13:28
  • @Someprogrammerdude: It is an XY situation, and this is a Y problem, because the X problem is much too wide and inspecific to be a single SO question. I have a tool which tests implementations of functions in a C-like language, whose source code and arguments are provided at run-time. Right now, I have to know the signature apriori. I'm trying to figure out whether it is realistic to try to do this entirely at run-time. If I had a parse tree / AST for the function, I think I could live with doing the rest myself. – einpoklum Jul 18 '22 at 13:39
  • 1
    @Someprogrammerdude: Hmm, that sounds like an answer actually. It's a fair bit of work, parsing a sub-grammar of C, and figuring out just the right regex, but it should be doable. – einpoklum Jul 18 '22 at 13:40
  • Are you only interested in function prototypes? What should the AST look like? Are there any function pointers as arguments? as return values? – chqrlie Jul 18 '22 at 14:46
  • @chqrlie: See edit. I'm interested in the function prototypes, since that's what I need to take the function's arguments from the command line. Or, ok, that's a bit of a fib, because if the function takes strange types I don't know about, I'll fail, but I can at least give an excuse for failing. I'm _indirectly_ interested in a function body, but that's examined by an opaque component which I don't control. – einpoklum Jul 18 '22 at 14:52
  • 1
    Crazy idea: what if compile it as C++ (C-code should be compilable) and extract symbols from object files? Function names will be decorated with argument and return types. – dimich Jul 18 '22 at 15:06
  • @dimich There are some C constructs that are not valid C++. See https://stackoverflow.com/questions/861517/what-issues-can-i-expect-compiling-c-code-with-a-c-compiler – Andrew Henle Jul 18 '22 at 15:15
  • @dimich: Why as C++? For the mangling of the full signature you mean? – einpoklum Jul 18 '22 at 16:06
  • @AndrewHenle: Well, yes, but not so much in function signatures. – einpoklum Jul 18 '22 at 16:07
  • @einpoklum Yes, for C++ name mangling. Of course it won't work for static functions. Hm, extract from debug info? Anyway, the idea is common for most answers here - use existing parser. – dimich Jul 18 '22 at 16:21

0 Answers0