27

Look at this code snippet:

int a = 0xe+1;

Clang, gcc, icc don't compile this:

t.cpp:1:12: error: invalid suffix '+' on integer constant

MSVC successfully compiles.

Which compiler is correct? If clang and gcc are correct, why is this happening?

Note: if I add a space before +, the code compiles. If I change 0xe to 0xf, it compiles too. Maybe this has to do something with exponential notation (like 1.2e+3)?

curiousguy
  • 8,038
  • 2
  • 40
  • 58
geza
  • 28,403
  • 6
  • 61
  • 135
  • If you mean `0xe + 1`, I believe you must put a space before the `+`. – Justin Mar 28 '18 at 20:28
  • @RemyLebeau: It doesn't look like the e+ should be parsed as scientific notation, though; it looks like it should be parsed as `0xe + 1`, a hexidecimal integer constant plus a decimal integer constant. – user2357112 Mar 28 '18 at 20:28
  • 1
    @user2357112 that is what the OP likely *wants*, but that is not how a compiler will *actually* parse it – Remy Lebeau Mar 28 '18 at 20:29
  • 1
    @RemyLebeau but exponential notation for hexadecimal float point literals is `p`, not `e`. – SergeyA Mar 28 '18 at 20:29
  • @RemyLebeau: Reading the standard, so far it looks like the compiler *should* parse it the way geza expects. – user2357112 Mar 28 '18 at 20:30
  • I believe the reason why this fails is related to [this answer to a question of mine](https://stackoverflow.com/a/49045039/1896169) or [this question](https://stackoverflow.com/q/38091427/1896169) – Justin Mar 28 '18 at 20:30
  • 1
    @Justin: Looks like the preprocessing-number thing answers it: the preprocessor tokenization rules are just kind of weird in a way that doesn't quite line up with the normal grammar. – user2357112 Mar 28 '18 at 20:34
  • 3
    [This note](http://eel.is/c++draft/lex.pptoken#4) seems pretty relevant. – Barry Mar 29 '18 at 02:57

1 Answers1

18

0xe+1 is treated as a single "preprocessing number" preprocessing token. This tokenization rule doesn't quite line up with the definition of numeric literals in the ordinary grammar; preprocessing numbers are defined as

pp-number:
    digit
    . digit
    pp-number digit
    pp-number identifier-nondigit
    pp-number ' digit
    pp-number ' nondigit
    pp-number e sign
    pp-number E sign
    pp-number p sign
    pp-number P sign
    pp-number .

If the tokenization rules were based on the numeric literal definitions instead of the simpler "preprocessing number" definition, your expression would be tokenized as 0xe + 1, but since the rules don't match up, you get a single 0xe+1 token, which is not a valid literal.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • The section mention: *Preprocessing number tokens lexically include all integer literal tokens and all floating literal tokens.* and 0xe is a valid integer token, so why compiler accept 11+1 and not 0xe+1. – Jean-Baptiste Yunès Mar 28 '18 at 20:50
  • @Jean-BaptisteYunès: There's no way to interpret `11+1` as a single preprocessing number; `+` can only follow an `e`, `E`, `p`, or `P` in a preprocessing number. – user2357112 Mar 28 '18 at 20:52
  • Oh I see! That means that pp-number definition covers integral and float literal tokens but is more flexible and accept sligthly more than valid literals... My problem is that given pp-number definition doesn't accept 0x as prefix; did I missed something? Identifier-nondigit derivation of course! This definition of pp-number is very very strange... – Jean-Baptiste Yunès Mar 28 '18 at 21:05
  • I've put a lot of thought into this because it leads to some surprising behavior in some cases. "Fixing" this in the preprocessing grammar is way more work than the benefit it brings, and it could have a notable negative impact on compile times. – Justin Mar 28 '18 at 21:48
  • 1
    @Justin: why? Parsing (just parsing, without value computation) integers/floats is easy and fast. We could have a separate pp-integer, and pp-float. Sure, it doesn't bring too much benefit, but the current behavior is weird. Presumably the standard already defines how to parse an integer or float, so the preprocessor could use the same definitions, so in the end, the standard would be a little bit simpler (there would be no need for pp-number). – geza Mar 29 '18 at 05:47
  • @geza One thing to note is how this plays out with UDLs; you can't write things like `123_myUdl.bar`. So I was investigating a solution for all cases. I thought about what possible change in the PP grammar would fix it, and everything I thought of got complicated fast. It may be that I'm just not very good at dealing with grammars, though, and that there may be an easy solution. At any rate, I stopped investigating, because the benefit is not worth the amount of effort I would have to put in. If you are willing to put in the effort to rework the grammar, you might consider making a proposal. – Justin Mar 29 '18 at 07:40
  • 1
    This behavior is commonly referred to as "maximal munch" (though "greedy lexing" would be clearer IMO). Also see [this answer](https://stackoverflow.com/questions/47467296/why-do-user-defined-string-literals-and-integer-literals-have-different-behavior). – Arne Vogel Mar 29 '18 at 09:36
  • The `0x` prefix is covered by the `pp-number identifier-nondigit` rule, which lets arbitrary alphabetic characters and `_` appear anywhere within a pp-number (given this, I find it unfortunate that cosmetic intra-number separators were added to the language using `'` instead of `_`). – zwol Mar 29 '18 at 23:48