5

In C++, we can use a wide variety of Unicode characters in identifiers. For example, you could name a variable résumé.

Those accented es can be represented in different ways: either as a precomposed character or as a plain e with a combining accent character. Many applications normalize such strings so that seemingly identical strings actually match.

Looking at the C++ standard, I don't see anything that requires the compiler to normalize identifiers, so variable résumé could be distinct from variable résumé. (In my tests, it doesn't seem as though MSVC nor clang normalize the identifiers.)

Is there anything that prohibits the compiler from choosing a normal form? If not, at what phase of translation should normalization occur?

[To be clear: I'm talking about identifiers, not string literals.]

curiousguy
  • 8,038
  • 2
  • 40
  • 58
Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
  • Possible duplicate of [Unicode Identifiers and Source Code in C++11?](https://stackoverflow.com/questions/5676978/unicode-identifiers-and-source-code-in-c11) – πάντα ῥεῖ Feb 15 '19 at 19:03
  • As a rule of thumb, you shouldn't use such symbols to keep your code as portable as much. – πάντα ῥεῖ Feb 15 '19 at 19:05
  • Possible duplicate of [Unicode/special characters in variable names in clang not allowed?](https://stackoverflow.com/questions/26660180/unicode-special-characters-in-variable-names-in-clang-not-allowed) – adrtam Feb 15 '19 at 19:05
  • 3
    the standard specifically defines character set as a subset of `ISO/IEC 10646`, which is ASCII, not unicode. – Serge Feb 15 '19 at 19:09
  • ISO/IEC 646 is related to ASCII, ISO/IEC 10646 is related to Unicode. Although many non-ASCII characters can be used, be careful. I'm not aware of any normalization (NFC, NFD) that will occur, so the identifiers will probably be as encoded in the source file. And if the team has some source file in NFC and some in NFD, (and my theory is correct), expect to be hunting down odd linker errors about missing symbols. Also, your toolchain may have constraints on identifiers for your platform (e.g., if you are on an IBM 3390). – Eljay Feb 15 '19 at 19:33
  • 1
    No, this is not a duplicate of either of those questions. The questions are: Is a C++ compiler _allowed_ to normalize identifiers? And, if a C++ compiler chose to normalize, at what phase of translation should it do so? – Adrian McCarthy Feb 15 '19 at 19:43
  • 1
    @Serge: You're missing table 2. It certainly includes many extended characters that are not ASCII. – Adrian McCarthy Feb 15 '19 at 20:02
  • Unicode is such a mess ... is there any case where different representation are handled as the same thing? Is that even doable? – curiousguy Feb 16 '19 at 18:29
  • @curiousguy: Yes, normalization is all about being able to handle different representations as the same. You can see this in your browser by choose to find `résumé` on this page, and you'll see that it finds all three instances in the question even though I used different representations when I wrote the question. – Adrian McCarthy Feb 17 '19 at 21:04
  • @AdrianMcCarthy Do any complex real world program that cares about matching strings really cares about doing normalization correctly? – curiousguy Feb 17 '19 at 22:21
  • 1
    @curiousguy: You mean besides browsers as I pointed out in my last comment? – Adrian McCarthy Feb 18 '19 at 00:26

1 Answers1

5

I believe the compiler is permitted to perform this normalization in translation phase 1:

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. The set of physical source file characters accepted is implementation-defined. Any source file character not in the basic source character set (5.3) is replaced by the universal-character-name that designates that character. An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (e.g., using the \uXXXX notation), are handled equivalently except where this replacement is reverted (5.4) in a raw string literal.

Since the mapping of source file characters to the basic source character set and to universal character names is implementation-defined, the implementation may choose to convert whatever byte sequences represent either the precomposed or decomposed lowercase-e-with-acute-accent to the same universal character name, but must document this choice.

Brian Bi
  • 111,498
  • 10
  • 176
  • 312
  • Yes, I'd read about the implementation-defined mapping, but it wasn't clear to me if that had to be a one-to-one mapping. I guess if CR+LF is supposed to become a newline, then it doesn't have to be a one-to-one mapping. So I think your interpretation is correct. – Adrian McCarthy Feb 15 '19 at 20:48