identifier character set (clang)

Question

I never use clang.

And I accidentally discovered that this piece of code:

#include <iostream>

void функция(int переменная)
{
    std::cout << переменная << std::endl;
}

int main()
{
    int русская_переменная = 0;
    функция(русская_переменная);
}

will compiles fine: http://rextester.com/NFXBL38644 (clang 3.4 (clang++ -Wall -std=c++11 -O2)).

Is it a clang extension ?? And why ? Thanks.

UPD: I'm more asking why clang make such decision ? Because I never found the discussion that someone want more characters then c++ standard have now (2.3, rev. 3691)

@quantdev I think the OP obviously means that symbols in cyrillic letters are accepted as [these aren't here](http://ideone.com/jTrSlJ) — πάντα ῥεῖ, Jun 28 '14 at 18:43
They're Cyrillic letters, so they're covered by the definition of an identifier, or am i missing something? — Anya Shenanigans, Jun 28 '14 at 18:44
@Petesh Yes, universal characters are allowed in identifiers, we can find the grammar where this is allowed in my answer to [Can you start a class name with a numeric digit?](http://stackoverflow.com/a/15285827/1708801) ... there are probably other threads that cover this as well. — Shafik Yaghmour, Jun 29 '14 at 01:46
@grisha - you've accepted the half-answer. The complete answer is just below it. This phenomenon has to do with 2 things: **(1)** the input encoding of the compiler (i.e. a way to get these characters into the compiler), and **(2)** the characters which are allowed in identifiers. Cyrillic characters are allowed, so it's just a matter of using a multibyte-capable source file encoding in your compiler. — rustyx, Sep 03 '16 at 17:04

Carl Norum · Accepted Answer · 2014-06-28T18:48:24.230

It's not so much an extension as it is Clang's interpretation of the Multibyte characters part of the standard. Clang supports UTF-8 source code files.

As to why, I guess "why not?" is the only real answer; it seems useful and reasonable to me to support a larger character set.

Here are the relevant parts of the standard (C11 draft):

5.2.1 Character sets

1 Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set). Each set is further divided into a basic character set, whose contents are given by this subclause, and a set of zero or more locale-specific members (which are not members of the basic character set) called extended characters. The combined set is also called the extended character set. The values of the members of the execution character set are implementation-defined.

2 In a character constant or string literal, members of the execution character set shall be represented by corresponding members of the source character set or by escape sequences consisting of the backslash \ followed by one or more characters. A byte with all bits set to 0, called the null character, shall exist in the basic execution character set; it is used to terminate a character string.

3 Both the basic source and basic execution character sets shall have the following members: the 26 uppercase letters of the Latin alphabet
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
the 26 lowercase letters of the Latin alphabet
a b c d e f g h i j k l m
n o p q r s t u v w x y z
the 10 decimal digits
0 1 2 3 4 5 6 7 8 9
the following 29 graphic characters
! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~
the space character, and control characters representing horizontal tab, vertical tab, and form feed. The representation of each member of the source and execution basic character sets shall fit in a byte. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. In source files, there shall be some way of indicating the end of each line of text; this International Standard treats such an end-of-line indicator as if it were a single new-line character. In the basic execution character set, there shall be control characters representing alert, backspace, carriage return, and new line. If any other characters are encountered in a source file (except in an identifier, a character constant, a string literal, a header name, a comment, or a preprocessing token that is never converted to a token), the behavior is undefined.

4 A letter is an uppercase letter or a lowercase letter as defined above; in this International Standard the term does not include other characters that are letters in other alphabets.

5 The universal character name construct provides a way to name other characters.

And also:

5.2.1.2 Multibyte characters

1 The source character set may contain multibyte characters, used to represent members of the extended character set. The execution character set may also contain multibyte characters, which need not have the same encoding as for the source character set. For both character sets, the following shall hold:

— The basic character set shall be present and each character shall be encoded as a single byte.

— The presence, meaning, and representation of any additional members is locale- specific.

— A multibyte character set may have a state-dependent encoding, wherein each sequence of multibyte characters begins in an initial shift state and enters other locale-specific shift states when specific multibyte characters are encountered in the sequence. While in the initial shift state, all single-byte characters retain their usual interpretation and do not alter the shift state. The interpretation for subsequent bytes in the sequence is a function of the current shift state.

— A byte with all bits zero shall be interpreted as a null character independent of shift state. Such a byte shall not occur as part of any other multibyte character.

2 For source files, the following shall hold:

— An identifier, comment, string literal, character constant, or header name shall begin and end in the initial shift state.

— An identifier, comment, string literal, character constant, or header name shall consist of a sequence of valid multibyte characters.

I think that the question is more like "does the standard allow arbitrary Unicode characters for identifiers"? — Matteo Italia, Jun 28 '14 at 18:43
The standard does not; I was saying "yes it's an extension." I'll edit to be clearer. — Carl Norum, Jun 28 '14 at 18:44
There are good reasons not to allow unicode identifiers. It is even a [standard loophole](http://meta.codegolf.stackexchange.com/a/1657/25116) in code golf. — nwp, Jun 28 '14 at 19:01
@CarlNorum In fact the C++ standard _does_ permit most Unicode characters in source code, via UCNs. And since clang's source encoding is UTF-8 the standard requires clang to support the UTF-8 encoded versions of these characters in identifiers. — bames53, Jun 28 '14 at 19:19

bames53 · Answer 2 · 2016-09-03T19:45:13.657

Given clang's usage of UTF-8 as the source encoding, this behavior is mandated by the standard:

C++ defines an identifier as the following:

identifier:
      identifier-nondigit
      identifier identifier-nondigit
      identifier digit
identifier-nondigit:
      nondigit
      universal-character-name
      other implementation-defined characters

The important part here is that identifiers can include unversal-character-names. The specifications also lists allowed UCNs:

Annex E (normative)

Universal character names for identifier characters [charname]

E.1 Ranges of characters allowed [charname.allowed]

00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF

0100-167F, 1681-180D, 180F-1FFF

200B-200D, 202A-202E, 203F-2040, 2054, 2060-206F

2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF

3004-3007, 3021-302F, 3031-303F

3040-D7FF

F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD

10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD,
  60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD,
  B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD

The cyrillic characters in your identifier are in the range 0100-167F.

The C++ specification further mandates that characters encoded in the source encoding be handled identically to UCNs:

Any source file character not in the basic source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently — n3337 §2.2 Phases of translation [lex.phases]/1

So given clang's choice of UTF-8 as the source encoding, the spec mandates that these characters be converted to UCNs (or that clang's behavior be indistinguishable from performing such a conversion), and these UCNs are permitted by the spec to appear in identifiers.

It goes even further. Emoji characters happen to be in the ranges allowed by the C++ spec, so if you've seen some of those examples of Swift code with emoji identifiers and were surprised by such capability you might be even more surprised to know that C++ has exactly the same capability:

http://rextester.com/EPYJ41676

https://i.stack.imgur.com/4619a.jpg

Another fact that may be surprising is that this behavior isn't new with C++11; C++ has mandated this behavior since C++98. It's just that compilers ignored this for a long time: Clang implemented this feature in version 3.3 in 2013. According to this documentation Microsoft Visual Studio supports this in 2015.

Even today GCC 6.1 only supports UCNs in identifiers when they are written literally, and does not obey the mandate that any character in its extended source character set must be treated identically with the corresponding universal-character-name. E.g. gcc allows int \u043a\u043e\u0448\043a\u0430 = 10; but will not allow int кошка = 10; even with -finput-charset=utf-8.

identifier character set (clang)

2 Answers2

Annex E (normative)

Universal character names for identifier characters [charname]

E.1 Ranges of characters allowed [charname.allowed]