3

The main reason is because I am sending Unicode data (bytes, not characters) over Sockets, and I wanted to make sure endianness matches up because wchar_t is UTF16.

Also the receiving program is my other one, so I will know that it is UTF16 and be able to react accordingly.

Here is my current algorithm that kind of works, but has a weird result. (This is in the same application because I wanted to learn how to convert it before sending it off)

case WM_CREATE: {   


    //Convert String to NetworkByte
    wchar_t Data[] = L"This is a string";
    char* DataA = (char*)Data;
    unsigned short uData = htons((unsigned int)DataA);

    //Convert String to HostByte
    unsigned short hData = ntohs(uData);
    DataA = (char*)&hData;
    wchar_t* DataW = (wchar_t*)DataA;
    MessageBeep(0);


    break;
}

Result:

쳌쳌쳌쳌쳌곭쳌쳌쳌쳌쳌ē쳌쳌쳌쳌This is a string
Trevin Corkery
  • 651
  • 7
  • 19
  • wchar is not UTF16 is wide charactere [read this](http://stackoverflow.com/questions/16050218/utf8-vs-wide-char). It can be anything. – Stargateur Oct 29 '16 at 07:37
  • My bad, I must of misread a thread talking about wchar_t/unicode/char. But yes you are right. (I heard it's default is UTF16 on MSVC compiler though) – Trevin Corkery Oct 29 '16 at 07:41
  • Type punning is UB in C++. I don't think you are allowed to do what you are doing with DataA. – asu Oct 29 '16 at 07:42
  • @Asu I was told if I wanted to send unicode over sockets was to cast it to bytes, send it over the network, then recreate the string by casting it back. If this is a bad way of doing it, is there a better way? Thanks – Trevin Corkery Oct 29 '16 at 07:44
  • 2
    Use `MultiByteToWideChar` and `WideCharToMultiByte` for conversion between UTF16 (Windows standard) and UTF8 (network friendly) [Example](http://stackoverflow.com/a/3999597/4603670) – Barmak Shemirani Oct 29 '16 at 07:46
  • Can we have more code here i don't see what you write on socket what you read. Please provide a minimal exemple. – Stargateur Oct 29 '16 at 07:47
  • 2
    You are casting and converting the pointer to the data, not the data itself. – Galik Oct 29 '16 at 07:48
  • 1
    You are not recreating the array. You are reinterpreting its address as an array of chars. From now on, writing and reading from it is undefined behavior. This is why you should be using `static_cast` most of the time instead of C-style arrays; they prevent such confusions. – asu Oct 29 '16 at 07:48
  • @Stargateur I updated the thread with the "full" function, but I don't send anything over the socket, I just convert it and then convert it back to learn how to do it properly before I would try to send it over the socket. – Trevin Corkery Oct 29 '16 at 07:51
  • @BarmakShemirani With your example, If I were to have a user type 盘, would converting it with WideCharToMultiByte end up breaking the 盘? – Trevin Corkery Oct 29 '16 at 07:53
  • @TrevinCorkery `L"盘"` is UTF16 `wchar_t`, it will be converted to UTF8 `u8"盘"` (stored in `char`) They are both the same text, just stored differently. Network functions expect UTF8 – Barmak Shemirani Oct 29 '16 at 07:55

2 Answers2

7

UTF8 and UTF16 store text in a completely different way. Casting wchar_t* to char* is meaningless, it's the same as casting float to char*.

Use WideCharToMultiByte to convert UTF16 to UTF8 to send to network function.

When receiving UTF8 from network functions, use MultiByteToWideChar to convert back to UTF16 so that it can be used in Windows functions.

Example:

#include <iostream>
#include <string>
#include <windows.h>

std::string get_utf8(const std::wstring &wstr)
{
    if (wstr.empty()) return std::string();
    int sz = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], -1, 0, 0, 0, 0);
    std::string res(sz, 0);
    WideCharToMultiByte(CP_UTF8, 0, &wstr[0], -1, &res[0], sz, 0, 0);
    return res;
}

std::wstring get_utf16(const std::string &str)
{
    if (str.empty()) return std::wstring();
    int sz = MultiByteToWideChar(CP_UTF8, 0, &str[0], -1, 0, 0);
    std::wstring res(sz, 0);
    MultiByteToWideChar(CP_UTF8, 0, &str[0], -1, &res[0], sz);
    return res;
}

int main()
{
    std::wstring greek = L"ελληνικά";

    std::string utf8 = get_utf8(greek);
    //use utf8.data() for network function...

    //convert utf8 back to utf16 so it can be displayed in Windows:
    std::wstring utf16 = get_utf16(utf8);
    MessageBoxW(0, utf16.c_str(), 0, 0);

    return 0;
}


Edit

Another example to show the difference between UTF16 and UTF8. This example looks at the byte values of UTF16 and UTF8.

Note that for Latin alphabet the UTF8 and ANSI bytes are exactly the same.

Also for Latin alphabet there is a similarity between UTF8 and UTF16, except UTF16 has an extra zero.

For Greek and Chinese alphabet there is a noticeable difference.

//(Windows example)
void printbytes_char(const char* ANSI_or_UTF8)
{
    const char *bytes = ANSI_or_UTF8;
    int len = strlen(bytes);
    for (size_t i = 0; i < len; i++)
        printf("%02X ", 0xFF & bytes[i]);
    printf("\n");
}

void printbytes_wchar_t(const wchar_t* UTF16)
{
    //Note, in Windows wchar_t length is 2 bytes
    const char *bytes = (const char*)UTF16;
    int len = wcslen(UTF16) * 2;
    for (size_t i = 0; i < len; i++)
        printf("%02X ", 0xFF & bytes[i]);
    printf("\n");
}

int main()
{
    printbytes_char("ABC");
    printbytes_char(u8"ABC");
    printbytes_wchar_t(L"ABC");

    printbytes_char(u8"ελληνικά");
    printbytes_wchar_t(L"ελληνικά");

    printbytes_char(u8"汉字/漢字");
    printbytes_wchar_t(L"汉字/漢字");
    return 0;
}

Output:

"ABC":
41 42 43 //ANSI
41 42 43 //UTF8
41 00 42 00 43 00 //UTF16 (this is little endian, bytes are swapped)

"ελληνικά"
CE B5 CE BB CE BB CE B7 CE BD CE B9 CE BA CE AC //UTF8
B5 03 BB 03 BB 03 B7 03 BD 03 B9 03 BA 03 AC 03 //UTF16

"汉字/漢字"
E6 B1 89 E5 AD 97 2F E6 BC A2 E5 AD 97 //UTF8
49 6C 57 5B 2F 00 22 6F 57 5B //UTF16
Barmak Shemirani
  • 30,904
  • 6
  • 40
  • 77
  • @Stargateur Yes, it's Windows specific. OP had tagged `winsock`. Unix based systems use UTF8 everywhere so they don't need this awkward conversion. – Barmak Shemirani Oct 29 '16 at 08:32
  • I think it's work because MessageBoxW handle unicode. Try with wprintf or std::cout in console. [MessageBoxW (Unicode) and MessageBoxA (ANSI)](https://msdn.microsoft.com/en-us/library/windows/desktop/ms645505(v=vs.85).aspx) – Stargateur Oct 29 '16 at 09:51
  • @Stargateur Windows has limited Unicode support for Windows console, that's yet another complication. For Windows API like `MessageBox` there is support UTF16 support (`MessageBoxW`) and ANSI support (`MessageBoxA`). UTF8 and ANSI are not the same. It just happens that for Latin alphabet the characters are the same in ANSI and UTF8. See updated answer. Windows cannot show UTF8 strings for non-Latin alphabet, you would need a Linux based machine to test that. – Barmak Shemirani Oct 29 '16 at 23:11
0
    wchar_t Data[] = L"test";

    //Convert String to NetworkByte
    for (wchar_t &val : Data) {
        if (sizeof(val) == 4) {
            val = htonl(val);
        }
        else if (sizeof(val) == 2) {
            val = htons(val);
        }
        else {
            static_assert(sizeof(val) <= 4, "wchar_t is gretter that 32 bit");
        }
    }

    //Convert String to HostByte
    for (wchar_t &val : Data) {
        if (sizeof(val) == 4) {
            val = ntohl(val);
        }
        else if (sizeof(val) == 2) {
            val = ntohs(val);

        }
        else {
            static_assert(sizeof(val) <= 4, "wchar_t is gretter that 32 bit");
        }
    }
Stargateur
  • 24,473
  • 8
  • 65
  • 91