11

Possible Duplicate:
C++ source in unicode

I just discovered this line of code in a project:

string überwachung;

I was surprised, because actually I thought you are not allowed to use umlauts like 'äöü' in C++ code other than in strings and so on, and it would result in a compiler error. But this compiles just fine with visual studio 2008.

  • Is this a special microsoft feature, or are umlauts allowed with other compilers too?
  • Are there any potential problems with that (portability,system language settings..)?
  • I can clearly remember this was not allowed. When did it change?

Kind regards for any clarification

P.S.: the tool cppcheck will even mark this usage as an error, even though it compiles

Community
  • 1
  • 1
nabulke
  • 11,025
  • 13
  • 65
  • 114
  • 1
    I think it depends on what encodings your compiler supports, this answer may be related http://stackoverflow.com/questions/5508110/why-is-this-program-erroneously-rejected-by-three-c-compilers/5508168#5508168 – josefx Apr 12 '11 at 14:23
  • @josefx: Erm, do you realize that's a joke question, right? I mean, some parts of the answer are certainly valid nonetheless, but linking to it is a bit strange... – Cody Gray - on strike Apr 12 '11 at 14:25
  • @Cody Gray right, I should have added a warning, but I tend to trust answers quoting a standard (which could backfire with joke questions). – josefx Apr 12 '11 at 14:45

5 Answers5

6

GCC complains on it: codepad

: error: stray '\303' in program

The C++ language standard itself limits the basic source character set to 91 printable characters plus tabs, form feed and new-line, which are all within ASCII. However, there's a nice footnote:

The glyphs for the members of the basic source character set are intended to identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII character set. However, because the mapping from source file characters to the source character set (described in translation phase 1) is specified as implementation-defined, an implementation is required to document how the basic source characters are represented in source files.

.. translation phase 1 is (emphasis mine)

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined.

Generally, you shouldn't use umlauts or other special characters in your code. If may work, but if it does, it's a compiler-specific feature.

Alexander Gessler
  • 45,603
  • 7
  • 82
  • 122
4

See section E/2 of the C++03 standard:

1 This clause lists the complete set of hexadecimal code values that are valid in universal-character-names in C++ identifiers (2.10).

Latin: 00c0–00d6, 00d8–00f6, 00f8–01f5, 01fa–0217, 0250–02a8, 1e00–1e9a, 1ea0–1ef9

This includes most accented letters.

The problem is that C++03 didn't specify UTF-8 as the input format. Even C++11 maintains compatibility with EBCDIC.

So, you can certainly create an identifier with an umlaut; the problem is getting a text editor that will interpret the universal-character-name and display it properly. Otherwise you're stuck inputting Unicode directly in hexadecimal format \uXXXX, e.g. \u00FC for ü.

A compiler which accepts UTF-8 in string constants but not in identifiers suffers from shortsighted implementation. Clang, at least, properly translates UTF-8 to universal-character-names in Phase 1.

Community
  • 1
  • 1
Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
2

I believe this is the clause that applies...

2.2 Character Sets

The basic source character set consists of 96 characters: the space character, the control characters representing horizontal tab, vertical tab, form feed, and new-line, plus the following 91 graphical characters:

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ˆ & | ˜ ! = , \ " ’

So the use of the unlaut would appear to be a compiler-specific extension.

John Dibling
  • 99,718
  • 31
  • 186
  • 324
1

This would be allowed by the standard if and only if your editor was translating from the character with an umlaut (or other diacritical) into one of the allowed characters. In particular, an identifier in C++ is defined as:

identifier:
    nondigit
    identifier nondigit
    identifier digit

nondigit: one of
    universal-character-name
    _ a b c d e f g h i j k l m
      n o p q r s t u v w x y z
      A B C D E F G H I J K L M
      N O P Q R S T U V W X Y Z

As far as I can see, that doesn't allow characters with diacriticals (except as a UCN). It looks to me like a compiler is required to issue at least one diagnostic for a program that contains any character other than those above (though it is still allowed to translate the program). Doing a quick check, I haven't been able to find a compiler flag that gets VC++ to issue a diagnostic for this code. At least IMO, it fails to conform in this respect.

On the other hand, this could just be viewed as VC++ implementing one of the new features of C++11. At least as of N3242, the new C++ draft adds a new item after the table above: "other implementation-defined characters". This gives the compiler permission to accept any other characters it wants to (though it is supposed to document what they are).

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • 1
    `universal-character-name` is the C++03 standard's custom encoding of Unicode. With a suitable editor, §E.2 guarantees that you can insert a `ü` (U+00FC) into C++ source; UTF-8 support is just a usability feature. – Potatoswatter Apr 12 '11 at 15:08
1

The compiler is free to support any characters in identifiers it desires. Your compiler apparently supports umlauts. However, it is not guaranteed by the language standard. You can't use umlauts if you expect your program to be standard-compliant.

For another example, some compilers allow using $ character in identifiers, while the language specification does not support it.

AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765