I'm probably going to get roasted by the language lawyers. But here's how I read the spec.
Concerning an identifier token, which is what the preprocessor is concerned with wrt to #define
substitution as well as the compiler itself. It does not include string literals. An identifier token is any sequence starting with a letter (a-z
, A-Z
) or underscore (_
) followed by any number of letters, digits, or underscores. In other words, the underscore uniquifies a token the same way any other letter or digit would.
Or slightly more formally (but not perfect), an identifier token is this regex:
[a-zA-Z_]([a-zA-Z0-9_])*
In other words, the underscore is just another letter.
Anything else not matched by the regex above is another type of token (string literal, whitespace, operator, parentheses, etc...)
And to bring it all home.
These expressions:
foo
_foo
->foo
&foo
"foo"
foo456
5foo
Are parsed as:
identifier (foo)
identifier (_foo)
operator (->) followed by identifier (foo)
operator (&) followed by identifier (foo)
string literal ("foo")
identifier (foo456)
invalid identifier: (5foo).
The preprocessor performing replacement for #define foo 123
is going to match identifiers named foo
. It will not expand foo456
into 123456 nor will it expand _foo
into _123
as those are completely different identifier names.
I'm not 100% positive about 5foo
, but I can't see anyway that's considered a valid identifier token. Or perhaps more precisely, it's an invalid number token.
Source: section 5.4 of the C++ draft standard (https://isocpp.org/files/papers/N4860.pdf) Yes, I know this question is tagged as C, but it should match close enough.