Preprocessing Tokens: '- -' vs. '--'

Question

Why does the (GCC) preprocessor create two tokens - -B instead of a single one --B in the following example? What is the logic that the former should be correct and not the latter?

#define A -B
-A

Output according to gcc -E:

- -B

After all, -- is a valid operator, so theoretically a valid token as well.

Is this specific to the GCC preprocessor or does this follow from the C standards?

@ThomasJager A snippet taken from the IOCCC is a horrible duplicate. It is only borderline on-topic here to begin with. — Lundin, May 30 '18 at 14:19

Petr Skocik · Answer 1 · 2018-05-30T16:27:31.103

2

The preprocessor works on tokens, not strings. Macro substitution without ## cannot create a new token and so, if the preprocessor output goes to a textfile as opposed to going straight into the compiler, preprocessors insert whitespace so that the outputted textfile can be used as C input again without changed semantics.

The space insertion doesn't seem to be in the standard, but then the standard describes the preprocessor as working on tokens and as feeding its output to the compiler proper, not a textfile.

edited May 30 '18 at 16:27

answered May 30 '18 at 13:31

Petr Skocik

58,047
6
95
142

I understand that `gcc -E` might show different output than what is actually sent to compiler. However, if you look at the program I posted, it seems `A` wasn't replaced by `--4` (there would be a compiler error) but by something else such as `- -4`. Why is it the latter and not the former? – bjorn93 May 30 '18 at 13:41
@bjorn93 It's the latter because the mighty standard sayeth "there shall be white-space." :) It's all in the answer. – Petr Skocik May 30 '18 at 13:43
Doesn't that part of the standard concern the definition of object-like macros? In other words, such macros should be defined as `#define identifier replacement_list` with space between identifier and replacement_list. This is not the white space I'm talking about. – bjorn93 May 30 '18 at 13:58
@bjorn93 My bad. I'll see if I can find the proper paragraph. – Petr Skocik May 30 '18 at 14:05
2

Inserting the space is proper because it properly represents the result of preprocessing. If `gcc -E` produced `--B`, then compiling that would produce the tokens `--` and `B`. That is different than the true result of the original preprocessing, which is the tokens `-`, `-`, and `B`. So the space is needed to properly portray the result of preprocessing. – Eric Postpischil May 30 '18 at 16:28
@EricPostpischil Is tokenization unambiguously (for the most part) specified in the C standards? How do we guarantee this is the correct way to separate into tokens? – bjorn93 May 30 '18 at 17:46
1

@bjorn93 Yes. Tokenization is, for the most part, well specified in the C standards. http://port70.net/~nsz/c/c11/n1570.html#6.4p3 Whitespace, unless inside a string or character literal, cannot be part of a token and is unambiguously a way to separate tokens. But realize that the preprocessor will normally turn segments of text that qualify as tokens into integers (normally also called tokens) which it will pass directly to the compiler (this implementation isn't standard mandated, it's just usual practice). Thus `--` will corespond to one integer and `-` will correspond to another. – Petr Skocik May 30 '18 at 18:06
@bjorn93 When pasting the integer for '-' next to another, the preprocessor will simply pass 2 integers to the compiler proper (instead of 1 which it would pass for `--`). It's only when the preprocessor is asked to generate text that it will need to deal with the space insertion part (at least if it wants to make your life easier). – Petr Skocik May 30 '18 at 18:06

score 1 · Accepted Answer · answered May 30 '18 at 16:49

1

Focusing on the white space insertion is missing the issue.

The macro A is defined as the sequence of preprocessing tokens - and B.

When the compiler parses a fragment of source code -A, it produces 2 tokens - and A. A is expanded as part of the preprocessing phase and the tokens are converted to C tokens: -, - and B.

If B is itself defined as a macro (#define B 4), A would expand to -, -, 4, which is parsed as an expression evaluating to the value 4 with type int.

gcc -E produces text. For the text to convert back to the same sequence of tokens as the original source code, a space needs to be inserted between the two - tokens to prevent -- to be parsed as a single token.

answered May 30 '18 at 16:49

chqrlie

131,814
10
121
189

1

So, the space in the output is needed merely to separate the tokens. Why does the `-traditional-cpp` option produce `--B` though? Is the preprocessor not using tokens in that case? – bjorn93 May 30 '18 at 17:28
1

Older C compilers did not treat this border case correctly. Depending on whether the C preprocessor was a separate utility with its output piped to the compiler or built into the compiler, this behavior could be used to paste tokens together. A non-portable side effect of the implementation. – chqrlie May 31 '18 at 00:01
So is the example we're talking about (in your post) the way preprocessing should be done according to the standards? That is, it's not compiler-dependent but follows from the standards? – bjorn93 May 31 '18 at 01:13
1

@bjorn93: Yes, this is correct. What I described in my answer follows from the C Standard. The Standard is a lot more precise and covers some other intricate corner cases... Fully understanding the C preprocessor is a life time commitment: I wrote several implementations that pass most of the validation tests I could find, and I still learn new tricks regularly. For example, try and explain why `printf("%x\n", 0x2e+1);` produces a syntax error instead of printing `2f`? – chqrlie May 31 '18 at 13:13

Preprocessing Tokens: '- -' vs. '--'

2 Answers2