68

I am to find that I cannot use as a valid identifier with g++ 4.7, even with the -fextended-identifiers option enabled:

int main(int argc, const char* argv[])
{
  const char*  = "I'm very happy";
  return 0;
}

main.cpp:3:3: error: stray ‘\360’ in program
main.cpp:3:3: error: stray ‘\237’ in program
main.cpp:3:3: error: stray ‘\230’ in program
main.cpp:3:3: error: stray ‘\203’ in program

After some googling, I discovered that UTF-8 characters are not yet supported in identifiers, but a universal-character-name should work. So I convert my source to:

int main(int argc, const char* argv[])
{
  const char* \U0001F603 = "I'm very happy";
  return 0;
}

main.cpp:3:15: error: universal character \U0001F603 is not valid in an identifier

So apparently isn't a valid identifier character. However, the standard specifically allows characters from the range 10000-1FFFD in Annex E.1 and doesn't disallow it as an initial character in E.2.

My next effort was to see if any other allowed Unicode characters worked - but none that I tried did. Not even the ever important PILE OF POO () character.

So, for the sake of meaningful and descriptive variable names, what gives? Does -fextended-identifiers do as it advertises or not? Is it only supported in the very latest build? And what kind of support do other compilers have?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Joseph Mansfield
  • 108,238
  • 20
  • 242
  • 324
  • Read [this](http://www.learncpp.com/cpp-tutorial/22-keywords-and-naming-identifiers/). – ErikEsTT Oct 02 '12 at 14:43
  • @ErikEsTT Unfortunately that page doesn't mention that an identifier can contain a `universal-character-name`, so whatever advice they give on naming conventions doesn't take into account the importance of using smiley faces as variable names. See §2.11 of ISO/IEC 14882:2011(E). – Joseph Mansfield Oct 02 '12 at 14:49
  • 2
    Hmm it seems the program `static const char* x = "I'm very happy";` crashes clang 3.1... – kennytm Oct 02 '12 at 14:51
  • See [this](http://msdn.microsoft.com/en-us/library/53y7f3az) example. – ErikEsTT Oct 02 '12 at 15:17
  • `clang` supports this since `3.3` with no special options but `gcc 4.8.1` still doesn't. Related: http://stackoverflow.com/questions/26660180/unicode-special-characters-in-variable-names-in-clang-not-allowed – alfC Oct 30 '14 at 18:56
  • You can, but you need C++11 also your source should be encoded as unicode. –  Jul 10 '16 at 05:19

3 Answers3

25

As of 4.8, GCC does not support characters outside of the BMP used as identifiers. It seems to be an unnecessary restriction. Also, GCC only supports a very restricted set of character described in ucnid.tab, based on C99 and C++98 (it is not updated to C11 and C++11 yet, it seems).

As described in the manual, -fextended-identifiers is experimental, so it has a higher chance won't work as expected.


GCC supported the C11 character set starting from 4.9.0 (SVN r204886 to be precise). So OP's second piece of code using \U0001F603 does work. I still can't get the actual code using to work even with -finput-charset=UTF-8 with GCC 8.2 on https://gcc.godbolt.org though (You may want to follow this bug report, provided by @DanielWolf).

Meanwhile, both pieces of code work on Clang 3.3 without any options other than -std=c++11.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • How about `main.cpp:3:15: error: universal character \u00a8 is not valid in an identifier`? This is with 4.7, though. – Joseph Mansfield Oct 02 '12 at 15:29
  • It [caught up](https://stackoverflow.com/questions/30130806/using-emoji-as-identifier-names-in-c-in-visual-studio-or-gcc/64108334#64108334) with GCC 10 (2020-05-07). This could be updated (but *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***[without](https://meta.stackexchange.com/a/131011)*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** "Edit:", "Update:", or similar - the answer should appear as if it was written today) – Peter Mortensen Aug 20 '23 at 09:49
17

This was a known bug in GCC 9 and before. This has been fixed in GCC 10.

The official changelog for GCC 10 contains this section:

Extended characters in identifiers may now be specified directly in the input encoding (UTF-8, by default), in addition to the UCN syntax (\uNNNN or \UNNNNNNNN) that is already supported:

static const int π = 3;
int get_naïve_pi() {
  return π;
}
Daniel Wolf
  • 12,855
  • 13
  • 54
  • 80
6

However, the standard specifically allows characters from the range 10000-1FFFD in Annex E.1 and doesn't disallow it as an initial character in E.2.

One thing to keep in mind is that just because the C++ standard allows (or disallows) some feature, does not necessarily mean that your compiler supports (or doesn't support) that feature.

Code-Apprentice
  • 81,660
  • 23
  • 145
  • 268
  • Yes, allowing the full set of Unicode characters specified by the standard is one that that, as far as I know, no compilers support yet, either literally or with UCNs. – bames53 Oct 02 '12 at 15:42
  • 1
    Of course! I only meant to find some documentation or source that shows they don't support this feature. – Joseph Mansfield Oct 02 '12 at 15:47
  • 1
    @sftrabbit Okay, maybe my answer is pointing out the obvious. KennyTM gave the link re: gcc. – Code-Apprentice Oct 02 '12 at 15:49