3

Being in a non English speaking country I wanted to do a test with char array and non ASCII character.

I compiled this code with MSVC and Mingwin GCC :

#include <iostream>

int main()
{
    constexpr char const* c = "é";
    int i = 0;

    char const* s;

    for (s = c; *s; s++)
    {
        i++;
    }

    std::cout << "Size: " << i << std::endl;

    std::cout << "Char size: " << sizeof(char) << std::endl;
}

Both display Char size: 1 but MSVC displays Size: 1 and Mingwin GCC displays Size: 2.

Is this an undefined behaviour caused by the non ASCII character or is there an other reason behind it (GCC encoding in UTF-8 and MSVC in UTF-16 maybe) ?

f222
  • 352
  • 2
  • 13
  • 1
    use [`std::u8string`](https://en.cppreference.com/w/cpp/string/basic_string) if you want to guarantee UTF-8 encoding. Otherwise MSVC uses the text encoding of the file, presumably GCC is converting to UTF-8? – Alan Birtles Oct 13 '22 at 10:59
  • 1
    Try prefixing the literal with `u8` or `u` – Igor R. Oct 13 '22 at 11:03
  • What is the character encoding of the source file that you are feeding to the compiler? According to [this link](https://gcc.gnu.org/onlinedocs/gcc-4.1.2/cpp/Character-sets.html), it must be UTF-8, if you are using gcc (which MinGW is based on). If you are unable to answer this question (for example because the text editor you are using does not provide this information), then please provide a [hex dump](https://en.wikipedia.org/wiki/Hex_dump) of the source file. – Andreas Wenzel Oct 13 '22 at 11:04
  • If you don't know how to create a hex dump, [this question](https://stackoverflow.com/questions/1724586/can-i-hex-edit-a-file-in-visual-studio) may be useful. It explains how to open a file in binary mode, so that you can use Visual Studio as a [hex editor](https://en.wikipedia.org/wiki/Hex_editor). – Andreas Wenzel Oct 13 '22 at 11:11
  • Nothing is as hard to get right as "Plain text" : Fun video here : https://www.youtube.com/watch?v=_mZBa3sqTrI – Pepijn Kramer Oct 13 '22 at 11:13
  • C++ sources are ASCII, you should not use non-ascii in source files. You can encode UTF-8 as "\x123" encoding. – Adrian Maire Oct 13 '22 at 11:47
  • @AdrianMaire that's a very bad idea. If C++ source files are ASCII only then no one would know what `\u4E0D\u8981\u7FFB\u8B6F\u9019\u500B` means because even comments can't contain Unicode. [Modern C++ even allows Unicode characters in identifiers](https://stackoverflow.com/q/5676978/995714) – phuclv Oct 13 '22 at 13:13
  • @phuclv: Reading that many "implementation defined", I conclude that I was right. If having `"Ao\x123t"` is not acceptable, then put that resource out of the source file. – Adrian Maire Oct 13 '22 at 13:29
  • @AdrianMaire that's still silly in most cases. Instead of using the resource in the binary file section directly which is very fast, now you need to do manual loading and other things. Some i18n engines also use literal strings for translation, for example GNU on Linux use `_`: `fprintf(stdout, _("Translate this\n"));` which is kind of terrible but convenient. Probably you've never do i18n or face Asian texts, or even non-Western European texts. Lots of programmers are not English-proficient and they also use comments extensively in their native language, especially in Japan – phuclv Oct 13 '22 at 13:45
  • @AdrianMaire there's a reason the C++ introduced `u8""`, `u16""` and `U""` strings – phuclv Oct 13 '22 at 14:34

1 Answers1

0

The encoding used to map ordinary string literals to a sequence of code units is (mostly) implementation-defined.

GCC defaults to UTF-8 in which the character é uses two code units and my guess is that MSVC uses code page 1252, in which the same character uses up only one code unit. (That encoding uses a single code unit per character anyway.)

Compilers typically have switches to change the ordinary literal and execution character set encoding, e.g. for GCC with the -fexec-charset option.

Also be careful that the source file is encoded in an encoding that the compiler expects. If the file is UTF-8 encoded but the compiler expects it to be something else, then it is going to interpret the bytes in the file corresponding to the intended character é as a different (sequence of) characters. That is however independent of the ordinary literal encoding mentioned above. GCC for example has the -finput-charset option to explicitly choose the source encoding and defaults to UTF-8.

If you intent the literal to be UTF-8 encoded into bytes, then you should use u8-prefixed literals which are guaranteed to use this encoding:

constexpr auto c = u8"é";

Note that the type auto here will be const char* in C++17, but const char8_t* since C++20. s must be adjusted accordingly. This will then guarantee an output of 2 for the length (number of code units). Similarly there are u and U for UTF-16 and UTF-32 in both of which only one code unit would be used for é, but the size of code units would be 2 or 4 bytes (assuming CHAR_BIT == 8) respectively (types char16_t and char32_t).

user17732522
  • 53,019
  • 2
  • 56
  • 105