1
#define foo 123

foo
_foo
->foo
&foo

Running gcc -E on the above file outputs this:

123
_foo
->123
&123

Which shows _ is not a token separator and _foo is a whole word. Where can I find a list of valid token separators for the C pre-processor?

Dan
  • 2,694
  • 1
  • 6
  • 19
  • 4
    @paulsm4 It's a little more complicated than that. Tokens need not be separated by whitespace. For example, see if you can identify the tokens in the expression `a+b-f(x,y)`. Hint: There are 10 tokens, but no whitespace. Any of the identifiers `a`, `b`, `f`, `x`, or `y` could be replaced by a macro expansion. – Tom Karzes Aug 12 '23 at 03:37

2 Answers2

4

I'm probably going to get roasted by the language lawyers. But here's how I read the spec.

Concerning an identifier token, which is what the preprocessor is concerned with wrt to #define substitution as well as the compiler itself. It does not include string literals. An identifier token is any sequence starting with a letter (a-z, A-Z) or underscore (_) followed by any number of letters, digits, or underscores. In other words, the underscore uniquifies a token the same way any other letter or digit would.

Or slightly more formally (but not perfect), an identifier token is this regex:

[a-zA-Z_]([a-zA-Z0-9_])*

In other words, the underscore is just another letter.

Anything else not matched by the regex above is another type of token (string literal, whitespace, operator, parentheses, etc...)

And to bring it all home.

These expressions:

foo
_foo
->foo
&foo
"foo"
foo456
5foo

Are parsed as:

identifier (foo)
identifier (_foo)
operator (->) followed by identifier (foo)
operator (&) followed by identifier (foo)
string literal ("foo")
identifier (foo456)
invalid identifier: (5foo).

The preprocessor performing replacement for #define foo 123 is going to match identifiers named foo. It will not expand foo456 into 123456 nor will it expand _foo into _123 as those are completely different identifier names.

I'm not 100% positive about 5foo, but I can't see anyway that's considered a valid identifier token. Or perhaps more precisely, it's an invalid number token.

Source: section 5.4 of the C++ draft standard (https://isocpp.org/files/papers/N4860.pdf) Yes, I know this question is tagged as C, but it should match close enough.

selbie
  • 100,020
  • 15
  • 103
  • 173
  • @harper - I believe the `*` applies only to the `([a-zA-Z0-9_])` capture group. Hence, the token must start with `[a-zA-Z_]` and may be followed by 0 more instances of `[a-zA-Z0-9_]` I don't think it would match a zero length token. Parsing is hard. – selbie Aug 12 '23 at 04:37
  • @harper - no problem! – selbie Aug 12 '23 at 04:40
  • Also interesting the `\w` (word) regex includes the very same `[a-zA-Z_]`. – David C. Rankin Aug 12 '23 at 05:22
  • *identifier token, which is what the preprocessor is concerned with.* No. The preprocessor is more than an identifier replacement machine. – n. m. could be an AI Aug 12 '23 at 06:07
  • `5foo` is a `pp-number` https://port70.net/~nsz/c/c11/n1570.html#6.4.8 . It's like a `pp-number (pp-number (pp-number (digit(5) identifier-nondigit(f)) identifier-nondigit(o)) identifier-nondigit(o))`. It's like `5LL` or `5f`, just `foo` is an invalid suffix for integer constant, but still a `pp-number` preprocessor token. Also, would be nice to mention, that identifiers may consist of implementation defined character. So `#define π 3.14` is allowed. – KamilCuk Aug 12 '23 at 11:02
1

Macro names can only start with [A-Z] [a-z] and _

So _foo is seen as completely different name. Just like xfoo would be.

You can read more here: What are the valid characters for macro names?

Karrer
  • 44
  • 5