Inconsistency parsing numeric literals according to C++ Standard's grammar

Question

Reading through the C++17 standard, it seems to me that there is an inconsistency between pp-number as handled by the preprocessor and numeric literals, e.g. user-defined-integer-literal, as they are defined to be handled by the "upper" language.

For example, the following is correctly parsed as a pp-number according to the preprocessor grammar:

123_e+1

But placed in the context of a C++11-compliant code fragment,

int  operator"" _e(unsigned long long)
    { return 0; }

int test()
    {
    return 123_e+1;
    }

the current Clang or GCC compilers (I haven't tested others) will return an error similar to this:

unable to find numeric literal operator 'operator""_e+1'

where operator"" _e(...) is not found and trying to define operator"" _e+1(...) would be invalid.

It seems that this comes about because the compiler lexes the token as a pp-number first, but then fails to roll-back and apply the grammar rules for a user-defined-integer-literal when parsing the final expression.

In comparison, the following code compiles fine:

int  operator"" _d(unsigned long long)
    { return 0; }

int test()
    {
    return 0x123_d+1;  // doesn't lex as a 'pp-number' because 'sign' can only follow [eEpP]
    }

Is this a correct reading of the standard? And if so, is it reasonable that the compiler should handle this, arguably rare, corner case?

There's no underscore allowed in [lex.ppnumber](http://eel.is/c++draft/lex.ppnumber)! So parsing `123_e+1` as `pp-number` is wrong... — Aconcagua, Dec 11 '18 at 12:17
@Aconcagua - actually _`pp-number`_ lexes _`identifier-nondigit`_ and _`nondigit`_ includes underscore. — Andy G, Dec 11 '18 at 12:21
This isn't an answer because I'm not 100% sure that it's true for C++17, but similar corner cases exist in C and the interpretation is that the standard actually _requires_ it to be an error: each pp-token is to be converted to one and only one phase-7 token, the compiler is not allowed to "roll back" as you put it. — zwol, Dec 11 '18 at 12:25
@AndyG Oh, you seem to be right, sorry - but `+` is not covered? Interesting: p/P are included, but the 0x prefix isn't either... — Aconcagua, Dec 11 '18 at 12:28
My first instinct is that this is a [maximal munch case like >=](https://stackoverflow.com/a/28354898/1708801) and [a+++++b](https://stackoverflow.com/a/24947922/1708801) CC @zwol — Shafik Yaghmour, Dec 11 '18 at 13:55
@Aconcagua `+` is covered by the `pp-number e sign` production, which is hard to spot at first and `0x` is covered by `pp-number identifier-nondigit`. — Shafik Yaghmour, Dec 11 '18 at 18:01
@ShafikYaghmour Hm, that would require to parse 1.e+E+p+P+e as `pp-number` as well... I'd say these construction rules are too generic then, next standard might (hopefully) get more precise (having two paths: allowing `e sign` only if not yet an identifier occured and vice versa)... — Aconcagua, Dec 11 '18 at 18:09
@Aconcagua it will be ill-formed at later stages, the grammar does not need to catch it here. — Shafik Yaghmour, Dec 11 '18 at 18:12
@ShafikYaghmour If it would, we wouldn't need spaces, thus the `e` and `d` examples would not need to be treated differently. Interesting example, though, on how good formatting can prevent errors... — Aconcagua, Dec 11 '18 at 18:15
Interesting to note that a similar `123_p+1` has inconsistent treatment from clang and gcc [see it live](https://godbolt.org/z/3MsH9R) and using that example as opposed to the `_d` one and asking why clang and gcc has inconsistent results seems like a reasonable way to differentiate this and the duplicate. — Shafik Yaghmour, Dec 12 '18 at 18:26

Shafik Yaghmour · Accepted Answer · 2018-12-12T01:14:57.320

3

You have fallen victim to the maximal munch rule which has the lexical analyzer take as many characters as possible to form a valid token.

This is covered in section [lex.pptoken]p3 which says (emphasis mine):

Otherwise, the next preprocessing token is the longest sequence of characters that could constitute a preprocessing token, even if that would cause further lexical analysis to fail, except that a header-name ([lex.header]) is only formed within a #include directive.

and includes several examples:

[ Example:
#define R "x"
const char* s = R"y";           // ill-formed raw string, not "x" "y"
— end example ]

4 [ Example: The program fragment 0xe+foo is parsed as a preprocessing number token (one that is not a valid floating or integer literal token), even though a parse as three preprocessing tokens 0xe, +, and foo might produce a valid expression (for example, if foo were a macro defined as 1). Similarly, the program fragment 1E1 is parsed as a preprocessing number (one that is a valid floating literal token), whether or not E is a macro name. — end example ]

5[ Example: The program fragment x+++++y is parsed as x ++ ++ + y, which, if x and y have integral types, violates a constraint on increment operators, even though the parse x ++ + ++ y might yield a correct expression. — end example ]

This rule effects in several other well known cases such as a+++++b and tokens >= which required a fix to allow.

For reference the pp-token grammar is as follows:

pp-number:  
  digit  
  . digit  
  pp-number digit  
  pp-number identifier-nondigit 
  pp-number ' digit  
  pp-number ' nondigit    
  pp-number e sign  
  pp-number E sign  
  pp-number p sign  
  pp-number P sign  
  pp-number .

Note the e sign production, which is what is snagging this case. If on the other hand you use d like your second example you would not hit this (see it live on godbolt).

Also adding spacing would also fix your issue since you would no longer be subject to maximal munch (see it live on godbolt):

123_e + 1

edited Dec 12 '18 at 01:14

answered Dec 11 '18 at 14:25

Shafik Yaghmour

154,301
39
440
740

1

Since you pinged me, again, I haven't close-read the C++ standard in many years and in particular I don't know if there are special rules for user-defined literals. This interpretation is correct for C provided you also take into account the usual interpretation of 5.1.1.2p1#7 "Each preprocessing token is converted into a token" meaning _one and only one_ token -- the compiler isn't supposed to split up a single pp-token into two phase 7 tokens, either, even when that would lead to a valid parse. – zwol Dec 11 '18 at 14:35
(Maximal munch _on phase 7 tokens_ would produce `0xe + foo` from example 4, because `0xe` is an integer-literal token and the production for integer-literals does not allow `+` at that point.) – zwol Dec 11 '18 at 14:36
Thanks for the answer and for the example from the standard. – Andy G Dec 11 '18 at 16:30
@zwol Just following the parser rules, I think `0xe+foo` would need to be accepted as one token - as far as I understand (and learned...), interpreting the token as integer or FP literal is done at a later stage? – Aconcagua Dec 11 '18 at 18:18
Now out of curiosity: How is the `x` in `0x` covered? Would it be treated as identifier-nondigit while parsing? – Aconcagua Dec 11 '18 at 18:19
1

@Aconcagua I mean that's the whole question. `0xe+foo` is one _preprocessing_ token. The standard _could_ have been written to specify that lexical analysis is repeated after preprocessing, in which case the pp-token `0xe+foo` would become three phase 7 tokens, the integer-literal `0x3`, the operator `+`, and the identifier `foo`. The C standard wasn't written that way; instead it requires `0xe+foo` be converted to a _single_ phase 7 token, and there is no lexical production that matches, so it's a syntax error. I believe, but am not 100% sure, that the C++ standard is the same way. – zwol Dec 11 '18 at 18:29
@zwol: C++ is the same. – rici Dec 15 '18 at 02:05
@rici Good to know, thanks. – zwol Dec 16 '18 at 15:33

Inconsistency parsing numeric literals according to C++ Standard's grammar

1 Answers1