C/C++ macro nested replacement logic

Question

I'm trying to implement C/C++-compatible macro processing. I can correctly handle many corner cases, including those discussed here: Understanding the behavior of C's preprocessor when a macro indirectly expands itself.

However, there's a corner case where I get a different answer from gcc and clang, so am obviously wrong. The code is similar to the case discussed here: Can we have recursive macros?

#define ID(arg) arg
#define EMPTY
#define NOEXPAND(macro) macro EMPTY
#define F_AGAIN() F
#define F() f NOEXPAND(F_AGAIN)()()

ID(F())

I get f F (), but gcc/clang output f f F_AGAIN ()(). My logic is that every token in f NOEXPAND(F_AGAIN)()() has F in its "hide set", and that you can't ever remove an element from a token's hide set.

In Can we have recursive macros?, the most relevant answer talks about painting macros blue rather than hide sets, but I can't find any kind of complete description of how this works. Hence, I am following Dave Prosser's algorithm, which explains cpp behavior in terms of hide sets. I thought that was what compilers did in practice.

My question: am I misinterpreting Prosser's algorithm, or should I be implementing a different algorithm? Is the behavior of modern C preprocessors documented anywhere? The language specification itself is overly vague on this question.

I would have to look at this in more detail, but consider that in processing `ID(F())`, first the macros in the argument to `ID` are replaced. So `F()` is processed. Once that processing is done, it is over, complete, finis. It is not remembered for later suppression. It is like a recursive function has been called and returned. It is no longer nested inside anything. Then `ID(f F_AGAIN ()())` is processed anew. — Eric Postpischil, Jun 05 '21 at 09:26
@EricPostpischil That's a priori logical, but happens to match neither what gcc/clang do, nor Prosser's algorithm. You can do `#define G() g NOEXPAND(G)()` and if your logic held, `ID(G())` should be `g g G()`, but it's only `g G()`. — user3188445, Jun 05 '21 at 09:30
@user3188445 The preprocessor does plain, straightforward text replacement, nothing more nothing less. What's seen 1st is replaced 1st, that's it. There's nothing special about that, no more _"logic"_ than that. — πάντα ῥεῖ, Jun 05 '21 at 09:37
@πάνταῥεῖ: Preprocessor replacement does **not** do “plain, straightforward text replacement.” First, it operates on preprocessor tokens, not text. Second, it has rules about the order of its operations and when replacement tokens are and are not considered for further replacement. Third, it has operators (`#` and `##`). I saw you incorrectly closed this user’s previous question as a duplicate. Please do not do that. — Eric Postpischil, Jun 05 '21 at 09:49
@πάνταῥεῖ MSVC notoriously had a bugged preprocessor for years, so it's not that simple. By "bugged" I mean minor details like the one in question, which nontheless broke complex macros. — HolyBlackCat, Jun 05 '21 at 09:55
@user3188445 *"can't find any kind of complete description of how this works"* Check [the standard](http://eel.is/c++draft/cpp.replace)? — HolyBlackCat, Jun 05 '21 at 09:58
@HolyBlackCat That's exactly where I started, but it turns out the standard is intentionally defective on this point because they were worried about affecting the compliance status of existing compilers. If the standard specified the behavior, I probably wouldn't be asking about it here. Don't believe me? See this: http://open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#268 (but note I handle the case in that discussion, which seems in line with my reading of Prosser). — user3188445, Jun 05 '21 at 10:13
*"committee's decision was that no realistic programs "in the wild" would venture into this area"* Ha! — HolyBlackCat, Jun 05 '21 at 10:16

score 1 · Answer 1 · answered Jun 05 '21 at 22:20

Okay, I think I've figured out what was going on. Compilers do not implement Prosser's algorithm, period. Even though Prosser's algorithm does resolve the particular known defect in the language specification, it doesn't explain the way actual compilers expand macros.

Instead of hide sets, there seem to be two kinds of disabling bits in play: a per-macro disabling bit, and a per-token disabling bit.

The per-macro disabling bit for a macro M is set during the the rescan phase whenever M is being expanded, and is cleared once expansion of M is complete.

If at any point while M is disabled the preprocessor processes an instance of M (whether or not syntactically valid for expansion), it marks that particular M token as disabled. Not only does the compiler avoid expanding a token when it disables the token, it also permanently disables any consideration of the token for expansion, in any context, even after M's expansion has finished.

So let's look at some even simpler examples:

#define ID(arg) arg
#define LPAREN (

#define F_AGAIN() F
#define F() f F_AGAIN LPAREN)()

F()         // => f F_AGAIN ()()
ID(F())     // => f f F_AGAIN ()()
ID(ID(F())) // => f f f F_AGAIN ()()

#define G() g G LPAREN)()

G()         // => g G ()()
ID(G())     // => g G ()()
ID(ID(G())) // => g G ()()

First consider what is happening in the expansion of G(). It expands to the token list g G LPAREN)() and during the re-scan, the G in that token list is permanently disabled. Now no matter how many times you re-scan the token list as a result of passing it through ID, the G can never be expanded.

Next consider what is happening with F(). It expands to the token list f F_AGAIN LPAREN)(). During the rescan, this gets expanded to f F_AGAIN ()(). Because F_AGAIN is not currently a disabled macro, none of these output tokens gets disabled. So now in any re-scan of the ID macro, F_AGAIN will get expanded once, also causing F to be expanded once.

With that context, it's actually possible to make some sense of the language spec:

If the name of the macro being replaced is found during this scan of the replacement list (not including the rest of the source file's preprocessing tokens), it is not replaced. Furthermore, if any nested replacements encounter the name of the macro being replaced, it is not replaced. These nonreplaced macro name preprocessing tokens are no longer available for further replacement even if they are later (re)examined in contexts in which that macro name preprocessing token would otherwise have been replaced.

The part that I guess messed with my intuition is that "it is not replaced" sounds so innocuous--particularly in contexts where the macro would not be replaced anyway, for instance because it's a function-like macro not followed by an open paren (. Then the passive voice in "tokens are no longer available for further replacement" makes it sound like it's just describing a consequence of the previous sentence, when really the spec means, "The compiler actively poisons that token and prohibits it from ever being expanded again."

C/C++ macro nested replacement logic

1 Answers1

Linked