2

I have a lot of preprocessor macro definitions, like this:

#define FOO 1
#define BAR 2
#define BAZ 3

In the real application, each definition corresponds to an instruction in an interpreter virtual machine. The macros are also not sequential in numbering to leave space for future instructions; there may be a #define FOO 41, then the next one is #define BAR 64.

I'm now working on a debugger for this virtual machine, and need to effectively 'reverse' these preprecessor macros. In other words, I need a function which takes the number and returns the macro name, e.g. an input of 2 returns "BAR".

Of course, I could create a function using a switch myself:

const char* instruction_by_id(int id) {
    switch (id) {
        case FOO:
            return "FOO";
        case BAR:
            return "BAR";
        case BAZ:
            return "BAZ";
        default:
            return "???";
    }
}

However, this will a nightmare to maintain, since renaming, removing or adding instructions will require this function to be modified too.

Is there another macro which I can use to create a function like this for me, or is there some other approach? If not, is it possible to create a macro to perform this task?

I'm using gcc 6.3 on Windows 10.

Aaron Christiansen
  • 11,584
  • 5
  • 52
  • 78
  • 1
    Just checking, is it necessary to do this with preprocessor macros, rather than a data structure of some kind? – David Z Apr 14 '18 at 08:59
  • @DavidZ I'm certainly open to different ways of approaching this; if there's a better way than preprocessor macros then I'd love to use it. – Aaron Christiansen Apr 14 '18 at 09:01
  • I am really surprised you have never been taught about *metaprogramming* approaches (at least with the example of parser generators or [compiler compiler](https://en.wikipedia.org/wiki/Compiler-compiler)). Where have you been taught C? – Basile Starynkevitch Apr 14 '18 at 09:35
  • 1
    Follow (when you have time) all the links in my answer. At some point you'll need to know all that. See http://norvig.com/21-days.html – Basile Starynkevitch Apr 14 '18 at 10:45
  • @BasileStarynkevitch Will do, thanks. (On an unrelated note, just though I'd let you know that the link to GCC MELT in your bio is broken.) – Aaron Christiansen Apr 14 '18 at 10:53
  • Yes. GCC MELT is finished (and I don't pay for the domain any more). But the pages are also on http://starynkevitch.net/Basile/gcc-melt/ ; and I updated my bio – Basile Starynkevitch Apr 14 '18 at 10:54

3 Answers3

4

You have the wrong approach. Read SICP if you have not read it.

I have a lot of preprocessor macro definitions, like this:

#define FOO 1
#define BAR 2
#define BAZ 3

Remember that C or C++ code can be generated, and it is quite easy to instruct your build automation tool to generate some particular C file (with GNU make or ninja you just add some rule or recipe).

For example, you could use some different preprocessor (liek GPP or m4), or some script -e.g. in awk or Python or Guile, etc..., or write your own program (in C, C++, Ocaml, etc...), to generate the header file containing these #define-s. And another script or program (or the same one, invoked differently) could generate the C code of instruction_by_id

Such basic metaprogramming techniques (of generating some or several C files from something higher level but specific) have been used since at least the 1980s (e.g. with yacc or RPCGEN). The C preprocessor facilitates that with its #include directive (since you can even include lines inside some function body, etc...). Actually, the idea that code is data (and proof) and data is code is even older (Church-Turing thesis, Curry-Howard correspondence, Halting problem). The Gödel, Escher, Bach book is very entertaining....

For example, you could decide to have a textual file opcodes.txt (or even some sqlite database containing stuff....) like

# ignore lines starting with an hashsign
FOO 1
BAR 2

and have two small awk or Python scripts (or two tiny C specialized programs), one generating the #define-s (into opcode-defines.h) and another generating the body of instruction_by_id (into opcode-instr.inc). Then you need to adapt your Makefile to generate these, and put #include "opcode-defines.h" inside some global header, and have

 const char* instruction_by_id(int id) {
    switch (id) {
 #include "opcode-instr.inc"
    default: return "???";
    }
 }

this will a nightmare to maintain,

Not so with such a metaprogramming approach. You'll just maintain opcodes.txt and the scripts using it, but you express a given "knowledge element" (the relation of FOO to 1) only once (in a single line of opcode.txt). Of course you need to document that (at the very least, with comments in your Makefile).

Metaprogramming from some higher-level, declarative formalization, is a very powerful paradigm. In France, J.Pitrat pioneered it (and he is writing an interesting blog today, while being retired) since the 1960s. In the US, J.MacCarthy and the Lisp community also.

For an entertaining talk, see Liam Proven FOSDEM 2018 talk on The circuit less traveled

Large software are using that metaprogramming approach quite often. For example, the GCC compiler have about a dozen of C++ code generators (in total, they are emitting more than a million of C++ lines).

Another way of looking at such an approach is the idea of domain-specific languages that could be compiled to C. If you use an operating system providing dynamic loading, you can even write a program emitting C code, forking a process to compile it into some plugin, then loading that plugin (on POSIX or Linux, with dlopen). Interestingly, computers are now fast enough to enable such an approach in an interactive application (in some sort of REPL): you can emit a C file of a few thousand lines, compile it into some .so shared object file, and dlopen that, in a fraction of second. You could also use JIT-compiling libraries like GCCJIT or LLVM to generate code at runtime. You could embed an interpreter (like Lua or Guile) into your program.

BTW, metaprogramming approaches is one of the reasons why basic compilation techniques should be known by most developers (and not only just people in the compiler business); another reason is that parsing problems are very common. So read the Dragon Book.

Be aware of Greenspun's tenth rule. It is much more than a joke, actually a profound truth about large software.

Community
  • 1
  • 1
Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • 1
    This is really fantastic. Thank you very much for such a detailed, comprehensive answer. As for your comment on my question: my C is self-taught, and a metaprogramming approach is kind of what I was looking for. (I mentioned in my question the idea of creating a custom macro.) I have done plenty of metaprogramming in dynamic languages (mainly Ruby) but never in C, and it didn't occur to me that I could `#include` anywhere (though it makes a lot of sense, since my understanding of `#include` is that it's basically a copy-paste). Thank you once again! – Aaron Christiansen Apr 14 '18 at 09:46
  • SICP link is broken (404). Here is archived page: http://web.archive.org/web/20230623125309/http://mitpress.mit.edu/9780262510875/structure-and-interpretation-of-computer-programs/ – jacobq Jul 18 '23 at 12:16
1

In a similar case I've resorted to defining a text file format that defines the instructions, and writing a program to read this file and write out the C source of the actual instruction definitions and the C source of functions like your instruction_by_id(). This way you only need to maintain the text file.

dmuir
  • 4,211
  • 2
  • 14
  • 12
1

As awesome as general code generation is, I’m surprised that nobody mentioned that (if you relax your problem definition just a bit) the C preprocessor is perfectly capable of generating the necessary code, using a technique called X macros. In fact every simple bytecode VM in C that I’ve seen uses this approach.

The technique works as follows. First, there is a file (call it insns.h) containing the authoritative list of instructions,

INSN(FOO, 1)
INSN(BAR, 2)
INSN(BAZ, 3)

or alternatively a macro in some other header containing the same,

#define INSNS \
  INSN(FOO, 1) \
  INSN(BAR, 2) \
  INSN(BAZ, 3)

whichever is more conveinent for you. (I’ll use the first option in the following.) Note that INSN is not defined anywhere. (Traditionally it would be called X, thus the name of the technique.) Wherever you want to loop over your instructions, define INSN to generate the code you want, include insns.h, then undefine INSN again.

In your disassembler, write

const char *instruction_by_id(int id) {
    switch (id) {
#define INSN(NAME, VALUE) \
    case NAME: return #NAME;
#include "insns.h" /* or just INSNS if you use a macro */ 
#undef INSN
    default: return "???";
    }
}

using the prefix stringification operator # to turn names-as-identifiers into names-as-string-literals.

You obviously can’t define the constants this way, because macros cannot define other macros in the C preprocessor. However, if you don’t insist that the instruction constants be preprocessor constants, there’s a different perfectly serviceable constant facility in the C language: enumerations. Whether or not you use an enumerated type, the enumerators defined inside it are regular integer constants from the point of view of the compiler (though not the preprocessor—you cannot use #ifdef with them, for example). So, using an anonymous enumeration type, define your constants like this:

enum {
#define INSN(NAME, VALUE) \
    NAME = VALUE,
#include "insns.h" /* or just INSNS if you use a macro */
#undef INSN
    NINSNS /* C89 doesn’t allow trailing commas in enumerations (but C99+ does), and you may find this constant useful in any case */
};

If you want to statically initialize an array indexed by your bytecodes, you’ll have to use C99 designated initializers {[FOO] = foovalue, [BAR] = barvalue, /* ... */} whether or not you use X macros. However, if you don’t insist on assigning custom codes to your instructions, you can eliminate VALUE from the above and have the enumeration assign consecutive codes automatically, and then the array can be simply initialized in order, {foovalue, barvalue, /* ... */}. As a bonus, NINSNS above then becomes equal to the number of the instructions and the size of any such array, which is why I called it that.

There are more tricks you can use here. For example, if some instructions have variants for several data types, the instruction list X macro can call the type list X macro to generate the variants automatically. (The somewhat ugly second option of storing the X macro list in a large macro and not an include file may be more handy here.) The INSN macro may take additional arguments such as the mode name, which would ignored in the code list but used to call the appropriate decoding routine in the disassembler. You can use token pasting operator ## to add prefixes to the names of the constants, as in INSN_ ## NAME to generate INSN_FOO, INSN_BAR, etc. And so on.

Alex Shpilkin
  • 776
  • 7
  • 17
  • Wow, this is really cool! I've never encountered X macros before but I'm glad I know about them now, and they make for a really nifty solution here. Thank you for adding such a detailed answer :) – Aaron Christiansen Jan 18 '21 at 16:22