8

Clang now (>3.3) supports Unicode characters in variable names: Clang 3.3 Release Notes, Major New Features.

However, some special character are still forbidden.

int main(){
    double α = 2.; // Alpha, ok!
    double ∞ = 99999.; // Infinity, error
}

giving:

error: non-ASCII characters are not allowed outside of literals and identifiers
        double ∞ = 99999.;

What is the fundamental difference between α (alpha) and (infinity) for Clang? That the former is Unicode and the latter is not Unicode, but at the same time is not ASCII?

Is there a workaround or an option to allow this set of characters in Clang (or BTW in GCC)?

Notes: 1) is just an example; there are a lot of characters that are potentially useful, but also forbidden, like or . 2) I am not asking if it is good idea, and please take it as a technical question. 3) I am interested in C++ compiler of Clang 3.4 in Linux (GCC 4.8.3 (2014-05-22) doesn't support this). I am saving the source files with gedit using UTF-8 encoding and Unix/Linux line ending. 4) adding other normal first characters doesn't help: _∞


The answers point to a definite NO. Some ranges are indeed not allowed nor will they be soon. To move one step further to total craziness, the best alternative I found was to use characters that effectively look the same. (Now, this I might admit is not a good idea.) Those alternatives can be found here http://shapecatcher.com/. The result (sorry if it hurts your eyes):

//double ∞ = 99999.; // Still an error
//double ⧞ = 99999.; // Infinity negated. Still an error

double ꝏ = 99999.;   // Letter oo
double Ꝏ = 99999.;  // Letter OO

//double ⧜ = 99999.; // Incomplete infinity. Still an error

Other "alternative" dead ringers mentioned in the question that are in the allowed range: ʃ, .

Note: This question has Unicode text that may not display correctly in all browsers.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
alfC
  • 14,261
  • 4
  • 67
  • 118
  • these names are a terrible idea. What do you want to achieve with that? Some sort of obfuscation contest?? – stefan Oct 30 '14 at 18:15
  • 3
    @stefan: Presumably, writing code that looks like mathematical notation. That's not a bad idea if the intended audience is mathematicians. – Mike Seymour Oct 30 '14 at 18:19
  • @MikeSeymour I'm a mathematician, and I hate it ;-) But fair enough.. It's non-portable though. That's the biggest drawback of anything. – stefan Oct 30 '14 at 18:20
  • 2
    @stefan I don't think `double const π = 3.14159265359;` is obfuscation used in the right context. Non-portability is another issue and it is part of the answer, after all the standard seems to allow it in a limited way. – alfC Oct 30 '14 at 18:37
  • [GCC caught up](https://stackoverflow.com/questions/30130806/using-emoji-as-identifier-names-in-c-in-visual-studio-or-gcc/64108334#64108334) with [GCC 10](https://gcc.gnu.org/releases.html) (2020-05-07). – Peter Mortensen Aug 20 '23 at 11:22

1 Answers1

12

So the Clang document says (emphasis mine):

This feature allows identifiers to contain certain Unicode characters, as specified by the active language standard;

This is covered in the draft C++ standard Annex E. The characters allowed are as follows:

E.1 Ranges of characters allowed [charname.allowed]

00A8, 00AA, 00AD,

00AF, 00B2-00B5, 00B7-00BA, 00BC-00BE, 00C0-00D6, 00D8-00F6, 00F8-00FF

0100-167F, 1681-180D, 180F-1FFF 200B-200D, 202A-202E, 203F-2040, 2054,

2060-206F 2070-218F, 2460-24FF, 2776-2793, 2C00-2DFF, 2E80-2FFF

3004-3007, 3021-302F, 3031-303F

3040-D7FF F900-FD3D, FD40-FDCF,

FDF0-FE44, FE47-FFFD

10000-1FFFD, 20000-2FFFD, 30000-3FFFD, 40000-4FFFD, 50000-5FFFD, 60000-6FFFD, 70000-7FFFD, 80000-8FFFD, 90000-9FFFD, A0000-AFFFD, B0000-BFFFD, C0000-CFFFD, D0000-DFFFD, E0000-EFFFD

The code for infinity 221E is not included in the list.

For reference: these are the codes above converted to Unicode characters (some of them may not display correctly in all browsers/available fonts).

¨, ª, ­,

¯, ²-µ, ·-º, ¼-¾, À-Ö, Ø-ö, ø-ÿ

Ā-ᙿ, ᚁ-᠍, ᠏-῿ ​-‍, ‪-‮, ‿-⁀, ⁔,

⁠- ⁰-↏, ①-⓿, ❶-➓, Ⰰ-ⷿ, ⺀-⿿

〄-〇, 〡-〯, 〱-〿

぀-퟿ 豈-ﴽ, ﵀-﷏,

ﷰ-﹄, ﹇-�

-, -, -, -, -, -, -, -, -, -, -, -, -, -

I could not find an extensive document that covers the rationale for the ranges chosen, although N3146: Recommendations for extended identifier characters for C and C++ does provide some details on the influences.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Shafik Yaghmour
  • 154,301
  • 39
  • 440
  • 740
  • I used this tool http://rishida.net/tools/conversion/ to convert your codes to their representation in your answer if you don't mind. Thank you. – alfC Oct 30 '14 at 18:47
  • 1
    Do you know the criterion used to choose these ranges of characters? – alfC Oct 30 '14 at 22:24
  • 2
    @alfC I feel like I have seen a rationale before but I can not find it anymore. I was able to find a document that goes into the many influences and mentions some rationale but is detail-lite. I added it to my answer. – Shafik Yaghmour Oct 31 '14 at 12:17