1

While making a compiler for C language, I was looking for its grammar. I stumbled upon ANSI C Grammar (Lex). There I found the following regular expression (RE):

{CP}?"'"([^'\\\n]|{ES})+"'"     { return I_CONSTANT; }

where CP and ES are as follows

CP  (u|U|L)
ES  (\\(['"\?\\abfnrtv]|[0-7]{1,3}|x[a-fA-F0-9]+))

I understand that ES is RE for escape sequence.

If I understand the regular expression correctly then, u'123' or U'\n\t' or L'abc' are valid I_CONSTANTs.

I wrote the following small program to see what constant values they represent.

#include <stdio.h>

int main(void) {
    printf("%d %d %d\n", u'123', U'\n\t', L'abc');
    return 0;
}

This gave the following output.

51 9 99

I deciphered that they represent the ASCII value of the right-most character inside single quotes. However, what I fail to understand is the use and importance of this kind of integer constant.

Mayank
  • 13
  • 3

1 Answers1

3

These are multicharacter literals, and their value is implementation-defined.

From C11 6.4.4.4 p10-11:

The value of an integer character constant containing more than one character (e.g., 'ab'), or containing a character or escape sequence that does not map to a single-byte execution character, is implementation-defined.

...

The value of a wide character constant containing more than one multibyte character or a single multibyte character that maps to multiple members of the extended execution character set, or containing a multibyte character or escape sequence not represented in the extended execution character set, is implementation-defined.

From your testing, it looks like GCC chooses to ignore all but the rightmost character on wide multicharacter literals. However, if you don't specify L, u, or U, GCC will combine the characters given as different bytes of the resulting integer, in an order depending on endianess. This behavior should not be relied on in portable code.

interjay
  • 107,303
  • 21
  • 270
  • 254