0

Trying to use escape sequences to construct a char8_t string (to not rely on file/compiler encoding), I got issue with MSVC.

I wonder if it is a bug, or if it is implemention dependent.
Is there a workaround?

constexpr char8_t s1[] =     u8"\xe3\x82\xb3 \xe3\x83\xb3 \xe3\x83\x8b \xe3\x83\x81 \xe3\x83\x8f";
constexpr unsigned char s2[] = "\xe3\x82\xb3 \xe3\x83\xb3 \xe3\x83\x8b \xe3\x83\x81 \xe3\x83\x8f";
//constexpr char8_t s3[] = u8"コ ン ニ チ ハ";

static_assert(std::equal(std::begin(s1), std::end(s1),
                         std::begin(s2), std::end(s2))); // Fail on msvc

Demo

Note: Final goal is to replace std::filesystem::u8path(s2) (std::filesystem::u8path is deprecated since C++20) by std::filesystem::path(s1);

Jarod42
  • 203,559
  • 14
  • 181
  • 302
  • Are you trying to enter UTF-8 code units, or do you want to specify Unicode codepoints? Because the latter is pretty easy, and I see no reason to want the former. – Nicol Bolas Mar 15 '22 at 14:27
  • @NicolBolas: It is probably my issue :-). as I try to enter utf8 code unit (as I did in C++17). So entering `\uxxxx` should fix the issue. – Jarod42 Mar 15 '22 at 14:31
  • UTF8 is multibyte, char is single byte. Those strings aren't the same even as characters, much less as bytes. The *correct* option is `s3` and using the correct file encoding. Everything else is guaranteed to cause errors, simply because the wrong text is entered. Even English text requires non-US-ASCII characters, eg `Charlotte Brontë` – Panagiotis Kanavos Mar 15 '22 at 14:32
  • @Jarod42 every web developer outside the US learned to save as UTF8 since the late 1990s. The only issue here is how the file is saved, not how the compiler would treat it – Panagiotis Kanavos Mar 15 '22 at 14:33
  • @NicolBolas: using `\uxxx` seems still broken under msvc [Demo](https://godbolt.org/z/8sKbcP7cK)... – Jarod42 Mar 15 '22 at 14:59
  • @PanagiotisKanavos: From [lex#phases-1.1](https://eel.is/c++draft/lex#phases-1.1), *"Physical source file characters are mapped, in an implementation-defined manner, to the translation character set"*, so it might depends of compiler (gcc has `-finput-charset`). – Jarod42 Mar 15 '22 at 15:14
  • @Jarod42: Well, what are the differences? – Nicol Bolas Mar 15 '22 at 15:15
  • @NicolBolas: Shouldn't I use `u8"\u30B3 \u30F3 \u30CB \u30C1 \u30CF"`? – Jarod42 Mar 15 '22 at 15:19
  • @Jarod42: I mean what bytes are actually stored in the two strings. I don't know what codepoints those characters map to, so I can't say if it's correct. But if you want to know the parameters of the bug, you should start with where the two strings differ. – Nicol Bolas Mar 15 '22 at 15:23
  • 1
    No you shouldn't use this. Nobody will be able to read this. Instead of having to ensure you use the correct LC_ALL setting just once on a machine, you'll have to wonder what your actual source code is every time, on every project – Panagiotis Kanavos Mar 15 '22 at 15:23
  • @Jarod42: That is, is the problem that MSVC is compiling the textual version wrong or the codepoint version wrong? – Nicol Bolas Mar 15 '22 at 15:24
  • @NicolBolas: My sample shows the displayed character too (`"コ ン ニ チ ハ"`). and gcc/clang accept the code... – Jarod42 Mar 15 '22 at 15:30
  • @Jarod42: That doesn't answer my question: what do the *actual bytes* say? – Nicol Bolas Mar 15 '22 at 15:31
  • Is `u8'\xe3'` valid or implementation dependant? (it is rejected by msvc) (testing `s1[0] == u8'`\xe3'`). – Jarod42 Mar 15 '22 at 15:38
  • @PanagiotisKanavos: I already got issues with files identified as extended ascii instead of UTF-8 (with some accentuated letter In French). So using exclusively ASCII (except in comment to make the escape sequence readable) avoid those issues. – Jarod42 Mar 15 '22 at 15:45
  • If that was true there would be no use for UTF8. Windows uses Unicode natively, so all you need to do is to ensure you save files as UTF8. MSVC specifically won't have any problem. On Linux, Mac you'll have to set the environment correctly (ie ensure LC_ALL uses UTF8). Again, all web developers use UTF8 to *avoid* the issues you face. Nobody uses escape sequences. This isn't just some developers, or even a minority. – Panagiotis Kanavos Mar 15 '22 at 16:08
  • You claim you have problems with a few characters in French. I have no problem with Greek characters like Αυτό Εδώ. StackOverflow has no problem, without encoding my comment, because it treats text as Unicode (because it's an ASP.NET Core application, saving text in Unicode fields). All I had to do since 2000 was ensure I saved files as UTF8. That includes both C++ and ASP files. `Extended ASCII` isn't an actual codepage, it's what people actually mean by ASCII. Your problems are probably because your machine uses Latin1/Windwos-1252 as the default encoding. Change your editor to save a UTF8 – Panagiotis Kanavos Mar 15 '22 at 16:15
  • There are a lot of similar questions from R or Python 2 users that moved to Python 3. Especially data scientists that started to work with Russian or Chinese data in the last decade. In all cases the solution is the same as any other kind of document: ensure source files are saved as UTF8 instead of the machine's/user's default codepage. – Panagiotis Kanavos Mar 15 '22 at 16:19
  • @PanagiotisKanavos: "*All I had to do since 2000 was ensure I saved files as UTF8.*" When you say that, do you mean with a UTF-8 BOM? Because that can be a problem for other compilers. It's better to just use the compiler switch that says that the file is UTF-8. – Nicol Bolas Mar 15 '22 at 17:37

1 Answers1

1

This is a bug in MSVC that I expect to be fixed at some point during Microsoft's implementation of C++23.

Historically, numeric escape sequences in character and string literals were not well specified in the C++ standard and this lead to a number of core issues. These issues were addressed by P2029; a paper adopted for C++23 in November of 2020. The reported MSVC bug (along with an additional one related to non-encodeable characters) is discussed in the "Implementation impact" section of the paper.

As mentioned by other commenters, use of universal-character-names (UCNs) like \u1234 would be the preferred solution to avoid a dependency on source file encoding.

Tom Honermann
  • 1,774
  • 1
  • 7
  • 10