3

I'm trying to use C++11 u8, u and U literals to encode this emoji:
http://www.fileformat.info/info/unicode/char/1f601/index.htm

Now, I'm using the hex values for each encoding to save it:

const char* utf8string = u8"\xF0\x9F\x98\x81";
const char16_t* utf16string = u"\xD83D\xDE01";
const char32_t* utf32string = U"\x0001F601";

This works fine in GCC 6.2 and Clang 3.8, each string has a length of 4, 2 and 1 respectively. But in Visual Studio 2015 compiler it has length of 8, 2, and 1 respectively.

I'm using this code to get the length of each string:

#include <iostream>
#include <cwchar>

int main() {
    const char* smiley8 = u8"\xF0\x9F\x98\x81";
    const char16_t* smiley16 = u"\xD83D\xDE01";
    const char32_t* smiley32 = U"\x0001F601";

    auto smiley8_it = smiley8;
    while ((*++smiley8_it) != 0);

    auto smiley16_it = smiley16;
    while ((*++smiley16_it) != 0);

    auto smiley32_it = smiley32;
    while ((*++smiley32_it) != 0);

    size_t smiley8_size = smiley8_it - smiley8;
    size_t smiley16_size = smiley16_it - smiley16;
    size_t smiley32_size = smiley32_it - smiley32;

    std::cout << smiley8_size << std::endl;
    std::cout << smiley16_size << std::endl;
    std::cout << smiley32_size << std::endl;
}

I also test the UTF-8 string using std::strlen.

Any clues why this happens?

Edoren
  • 83
  • 4
  • TDM-GCC 5.1 on Windows returns 4 2 1 Doesn't help just adding info. – Richard Critten Feb 23 '17 at 15:57
  • Why not use universal character names instead? This seems like a doctor, doctor kind of situation. – Kerrek SB Feb 23 '17 at 15:58
  • [This answer](http://stackoverflow.com/a/23950172/1382251) might help. – barak manos Feb 23 '17 at 15:58
  • It seems like VC takes \xF0, interprets it as U+0x00F0, then proceeds to encode it in UTF-8, since it's a `u8""`. Don't know, offhand, which compiler is right, just that it's obviously what the compiler is doing. – Sam Varshavchik Feb 23 '17 at 16:00
  • If you inspect the 1st string in the debug memory view window what do you see? – Richard Critten Feb 23 '17 at 16:00
  • @RichardCritten it contains c3 b0 c2 9f c2 98 c2 81 – Edoren Feb 23 '17 at 16:16
  • 1
    It seems MSVC is wrong. "**[lex.string]/16** The size of a narrow string literal is the total number of escape sequences and other characters, plus at least one for the multibyte encoding of each universal-character-name, plus one for the terminating `’\0’`." So the size of `u8"\xF0\x9F\x98\x81"` should be 5, not 10. – Igor Tandetnik Feb 23 '17 at 16:38
  • 1
    See also [DR1656](http://www.open-std.org/jtc1/sc22/wg21/docs/cwg_active.html#1656), which pretty clearly suggests that MSVC is wrong, that escape sequences are meant to be taken as elements in the object representation of the string literal, and not as Unicode characters to be UTF-8 encoded. – Igor Tandetnik Feb 23 '17 at 16:54

0 Answers0