1

I am trying to convert Unicode string to UTF8 string :

#include <stdio.h>
#include <string>
#include <atlconv.h>
#include <atlstr.h>

using namespace std;

CStringA ConvertUnicodeToUTF8(const CStringW& uni)
{
    if (uni.IsEmpty()) return "";
    CStringA utf8;
    int cc = 0;

    if ((cc = WideCharToMultiByte(CP_UTF8, 0, uni, -1, NULL, 0, 0, 0) - 1) > 0)
    {
        char *buf = utf8.GetBuffer(cc);
        if (buf) WideCharToMultiByte(CP_UTF8, 0, uni, -1, buf, cc, 0, 0);
        utf8.ReleaseBuffer();
    }
    return utf8;
}

int main(void)
{
    string u8str = ConvertUnicodeToUTF8(L"gökhan");

    printf("%d\n", u8str.size());

    return 0;
}

My question is : Should u8str.size() return value be 6? It prints 7 now!

πάντα ῥεῖ
  • 1
  • 13
  • 116
  • 190
codeator
  • 43
  • 6
  • 3
    First of all, Unicode isn't an encoding scheme, it's a whole bunch of them, UTF-8, UTF-16. Microsoft is probably mostly to blame for this as for them Unicode is UTF-16. But they also call ASCII ANSI, which it isn't. – Zebrafish Nov 26 '16 at 07:00
  • 3
    @Titone They don't call ASCII ANSI. ASCII is a 7 bit encoding and what MS call ANSI is an 8 bit encoding. In Microsoft's defence, when they introduced Unicode the world look very different. UTF-8 and UTF-16 and UTF-32 did not exists. It was UCS-2 back then. MS get slated for using UTF-16 rather than UTF-8 but there are perfectly reasonable historical reasons for it. – David Heffernan Nov 26 '16 at 07:30

3 Answers3

2

7 is correct. The non ASCII character ö is encoded with two bytes.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • 2
    The important issue here is that most `strlen`-style functions, including `std::string.size()`, return the count of *code units*, rather than the number of *code points*. A single code point may be represented using multiple code units. ö is a single code point, but in a single-byte encoding like UTF-8, it is actually encoded using two code units. If I had a really good link that explained this difference, I'd edit it into the answer, but I don't know of one. Wikipedia works, but takes some legwork to understand the distinction. – Cody Gray - on strike Nov 26 '16 at 08:55
2

By definition, "multi byte" means that each unicode entity may occupy up to 6 bytes, see here: How many bytes does one Unicode character take?

Further reading: http://www.joelonsoftware.com/articles/Unicode.html

Community
  • 1
  • 1
Anton Shmelev
  • 75
  • 3
  • 7
  • 2
    5-byte and 6-byte variants of UTF-8 are not used in practice, as such high codepoints are not defined by Unicode and are not compatible with UTF-16. UTF-8 is limited to 4 bytes by [RFC 3629](https://tools.ietf.org/html/rfc3629) so it can encode only the range of codepoints supported by UTF-16. – Remy Lebeau Nov 26 '16 at 23:08
0

A Unicode codepoint uses 2 or 4 bytes in UTF-16, but uses 1-4 bytes in UTF-8, depending on its value. It is possible for a 2-byte codepoint value in UTF-16 to use 3-4 bytes in UTF-8, thus a UTF-8 string may use more bytes than the corresponding UTF-16 string. UTF-8 tends to be more compact for Latin/Western languages, but UTF-16 tends to be more compact for Eastern Asian languages.

std::(w)string::size() and CStringT::GetLength() count the number of encoded codeunits, not the number of codepoints. In your example, "gökhan" is encoded as:

UTF-16LE: 0x0067 0x00f6 0x006b 0x0068 0x0061 0x006e
UTF-16BE: 0x6700 0xf600 0x6b00 0x6800 0x6100 0x6e00
UTF-8: 0x67 0xc3 0xb6 0x6b 0x68 0x61 0x6e

Notice that ö is encoded using 1 codeunit in UTF-16 (LE: 0x00f6, BE: 0xf600) but uses 2 codeunits in UTF-8 (0xc3 0xb6). That is why your UTF-8 string has a size of 7 instead of 6.

That being said, when calling WideCharToMultiByte() and MultiByteToWideChar() with -1 as the source length, the function has to manually count the characters, and the return value will include room for a null terminator when the destination pointer is NULL. You don't need that extra space when using CStringA/W, std::(w)string, etc, and you don't need the overhead of counting characters when the source already knows its length. You should always specify the actual source length when you know it, eg:

CStringA ConvertUnicodeToUTF8(const CStringW& uni)
{
    CStringA utf8;

    int cc = WideCharToMultiByte(CP_UTF8, 0, uni, uni.GetLength(), NULL, 0, 0, 0);
    if (cc > 0)
    {
        char *buf = utf8.GetBuffer(cc);
        if (buf)
        {
            cc = WideCharToMultiByte(CP_UTF8, 0, uni, uni.GetLength(), buf, cc, 0, 0);
            utf8.ReleaseBuffer(cc);
        }
    }

    return utf8;
}
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770