The encoding used to map ordinary string literals to a sequence of code units is (mostly) implementation-defined.
GCC defaults to UTF-8 in which the character é
uses two code units and my guess is that MSVC uses code page 1252, in which the same character uses up only one code unit. (That encoding uses a single code unit per character anyway.)
Compilers typically have switches to change the ordinary literal and execution character set encoding, e.g. for GCC with the -fexec-charset
option.
Also be careful that the source file is encoded in an encoding that the compiler expects. If the file is UTF-8 encoded but the compiler expects it to be something else, then it is going to interpret the bytes in the file corresponding to the intended character é
as a different (sequence of) characters. That is however independent of the ordinary literal encoding mentioned above. GCC for example has the -finput-charset
option to explicitly choose the source encoding and defaults to UTF-8.
If you intent the literal to be UTF-8 encoded into bytes, then you should use u8
-prefixed literals which are guaranteed to use this encoding:
constexpr auto c = u8"é";
Note that the type auto
here will be const char*
in C++17, but const char8_t*
since C++20. s
must be adjusted accordingly. This will then guarantee an output of 2
for the length (number of code units). Similarly there are u
and U
for UTF-16 and UTF-32 in both of which only one code unit would be used for é
, but the size of code units would be 2 or 4 bytes (assuming CHAR_BIT == 8
) respectively (types char16_t
and char32_t
).