8

I am trying to use Unicode variable names in g++, but it does not appear to work.

Does g++ not support Unicode variable names? Or is there some subset of Unicode (from which I'm not testing in)?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
anon
  • 41,035
  • 53
  • 197
  • 293
  • ¤ g++ is just not standard-conforming wrt. characters in identifiers. But I don't know of any compiler that *is* conforming. It is my impression that most compilers limit the identifier characters to English A...Z and underscore, plus $ sign, which is wrong in two ways: not allowing the huge range of Unicode characters specified in Annex E of the standard (I've listed them at http://pastie.org/3110152), and allowing $, which the standard does not allow. In short, the standard and existing practice is very much at odds. Perhaps with C++11... ;-) Cheers & hth., – Cheers and hth. - Alf Jan 02 '12 at 03:33
  • 1
    @Cheersandhth.-Alf Try clang :) – Richard Smith Jul 24 '13 at 07:24
  • Possible duplicate: *[ (and other Unicode characters) in identifiers not allowed by g++](https://stackoverflow.com/questions/12692067/)*. – Peter Mortensen May 04 '23 at 21:42
  • 1
    "it does not appear to work" Don't hesitate, go ahead and tell us *how* it doesn't work. We've been waiting for 13 years. – n. m. could be an AI May 08 '23 at 22:13
  • @n. m. could be an AI: The OP has left the building: *"Last seen more than 12 years ago"* – Peter Mortensen Aug 20 '23 at 11:24

2 Answers2

10

You have to specify the -fextended-identifiers flag when compiling. You also have to use \uXXXX or \uXXXXXXXX for Unicode (at least in GCC, it's Unicode).

Identifiers (variable/class names, etc.) in g++ can't be of UTF-8/UTF-16 or whatever encoding. They have to be:

identifier:
  nondigit
  identifier nondigit
  identifier digit

A nondigit is

nondigit: one of
  universalcharactername
  _ a b c d e f g h i j k l m n o p q r s t u v w x y z
  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

And a universalcharactername is

universalcharactername:
  \UXXXXXXXX
  \uXXXX

Thus, if you save your source file as UTF-8, you cannot have a variable like:

int høyde = 10;

It had to be written like:

int h\u00F8yde = 10;

(which, in my opinion, would defeat the purpose. So just stick with a-z)

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
nos
  • 223,662
  • 58
  • 417
  • 506
  • 3
    Is there better support in clang? – anon Apr 21 '10 at 10:27
  • I don't know, but you should ask another question for that. – nos Apr 21 '10 at 11:25
  • 1
    g++ is not standard-conforming here (but neither are other compilers, including Comeau). For standard C++, in the very first phase of translation "Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character", and the lexer rules operate on the result of that. In the C++11 standard this is specified in "Phases of translation" §2.2/1 1st list item. – Cheers and hth. - Alf Jan 02 '12 at 03:23
  • 3
    @anon Yes, clang allows accented characters in identifiers. – Richard Smith Jul 24 '13 at 07:23
  • 4
    @anon **yes**, from clang 3.3 onwards [there is](http://llvm.org/releases/3.3/tools/clang/docs/ReleaseNotes.html#extended-identifiers-unicode-support-and-universal-character-names) support for unicode identifiers right in UTF-8. – ulidtko Mar 20 '14 at 17:19
  • 1
    9 years later G++ 9.1 is _still_ blind to UTF-8 symbols, even with `-fextended-identifiers -finput-charset=UTF-8`. (For reference, also MSVC++ does fine, either with -utf-8 or with a BOM in the source.) See also: https://stackoverflow.com/a/12693346/1479945 – Sz. Aug 13 '19 at 10:59
4

A one-line patch to the C++ preprocessor allows UTF-8 input. Details for GCC are given at UTF-8 Identifiers in GCC.

However, since the preprocessor is shared, the same patch should work for g++ as well. In particular, the patch needed, as of gcc-5.2 is

diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c

Output:

*** gcc-5.2.0/libcpp/charset.c  Mon Jan  5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c  Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;
--- 1711,1717 ----
    struct _cpp_strbuf to;
    unsigned char *buffer;

!   input_cset = init_iconv_desc (pfile, "C99", input_charset);
    if (input_cset.func == convert_no_conversion)
      {
        to.text = input;

Note that for the above patch to work, a recent version of iconv needs to be installed that supports C99 conversions. Type iconv --list to verify this. Otherwise, you can install a new version of iconv along with GCC as described in the link above.

Change the configure command to

../gcc-5.2.0/configure -v --disable-multilib \
    --with-libiconv-prefix=/usr/local/gcc-5.2 \
    --prefix=/usr/local/gcc-5.2 \
    --enable-languages="c,c++"

if you are building for x86 and want to include the C++ compiler as well.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ejolson
  • 146
  • 4