8

I have string which fill up by data from other program and this data can be with UTF8 encoding or not. So if not i can encode to UTF8 but what is the best way to detect UTF8 in the C++? I saw this variant https://stackoverflow.com/questions/... but there are comments which said that this solutions give not 100% detection. So if i do encoding to UTF8 string which already contain UTF8 data then i write wrong text to database.

So can i just use this UTF8 detection :

bool is_utf8(const char * string)
{
    if(!string)
        return 0;

    const unsigned char * bytes = (const unsigned char *)string;
    while(*bytes)
    {
        if( (// ASCII
             // use bytes[0] <= 0x7F to allow ASCII control characters
                bytes[0] == 0x09 ||
                bytes[0] == 0x0A ||
                bytes[0] == 0x0D ||
                (0x20 <= bytes[0] && bytes[0] <= 0x7E)
            )
        ) {
            bytes += 1;
            continue;
        }

        if( (// non-overlong 2-byte
                (0xC2 <= bytes[0] && bytes[0] <= 0xDF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF)
            )
        ) {
            bytes += 2;
            continue;
        }

        if( (// excluding overlongs
                bytes[0] == 0xE0 &&
                (0xA0 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// straight 3-byte
                ((0xE1 <= bytes[0] && bytes[0] <= 0xEC) ||
                    bytes[0] == 0xEE ||
                    bytes[0] == 0xEF) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            ) ||
            (// excluding surrogates
                bytes[0] == 0xED &&
                (0x80 <= bytes[1] && bytes[1] <= 0x9F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF)
            )
        ) {
            bytes += 3;
            continue;
        }

        if( (// planes 1-3
                bytes[0] == 0xF0 &&
                (0x90 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// planes 4-15
                (0xF1 <= bytes[0] && bytes[0] <= 0xF3) &&
                (0x80 <= bytes[1] && bytes[1] <= 0xBF) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            ) ||
            (// plane 16
                bytes[0] == 0xF4 &&
                (0x80 <= bytes[1] && bytes[1] <= 0x8F) &&
                (0x80 <= bytes[2] && bytes[2] <= 0xBF) &&
                (0x80 <= bytes[3] && bytes[3] <= 0xBF)
            )
        ) {
            bytes += 4;
            continue;
        }

        return 0;
    }

    return 1;
}

And this code for encoding to UTF8 if detection is not true :

     string text;
     if(!is_utf8(EscReason.c_str()))
     {
        int size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), 0, 0);
        std::wstring utf16_str(size, '\0');

        MultiByteToWideChar(CP_ACP, MB_COMPOSITE, text.c_str(),
            text.length(), &utf16_str[0], size);
    
        int utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), 0, 0, 0, 0);

        std::string utf8_str(utf8_size, '\0');
        WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(),
            utf16_str.length(), &utf8_str[0], utf8_size, 0, 0);

        text = utf8_str;
     }

Or code above is not done properly? Also i do it in the Windows 7. And how about Ubuntu? Does this variant work there?

Community
  • 1
  • 1
ratojakuf
  • 708
  • 1
  • 11
  • 21

2 Answers2

19

Comparing whole byte values is not the correct way to detect UTF-8. You have to analyze the actual bit patterns of each byte. UTF-8 uses a very distinct bit pattern that no other encoding uses. Try something more like this instead:

bool is_utf8(const char * string)
{
    if (!string)
        return true;

    const unsigned char * bytes = (const unsigned char *)string;
    int num;

    while (*bytes != 0x00)
    {
        if ((*bytes & 0x80) == 0x00)
        {
            // U+0000 to U+007F 
            num = 1;
        }
        else if ((*bytes & 0xE0) == 0xC0)
        {
            // U+0080 to U+07FF 
            num = 2;
        }
        else if ((*bytes & 0xF0) == 0xE0)
        {
            // U+0800 to U+FFFF 
            num = 3;
        }
        else if ((*bytes & 0xF8) == 0xF0)
        {
            // U+10000 to U+10FFFF 
            num = 4;
        }
        else
            return false;

        bytes += 1;
        for (int i = 1; i < num; ++i)
        {
            if ((*bytes & 0xC0) != 0x80)
                return false;
            bytes += 1;
        }
    }

    return true;
}

Now, this does not take into account illegal UTF-8 sequences, such as overlong encodings, UTF-16 surrogates, and codepoints above U+10FFFF. If you want to make sure the UTF-8 is both valid and correct, you would need something more like this:

bool is_valid_utf8(const char * string)
{
    if (!string)
        return true;

    const unsigned char * bytes = (const unsigned char *)string;
    unsigned int cp;
    int num;

    while (*bytes != 0x00)
    {
        if ((*bytes & 0x80) == 0x00)
        {
            // U+0000 to U+007F 
            cp = (*bytes & 0x7F);
            num = 1;
        }
        else if ((*bytes & 0xE0) == 0xC0)
        {
            // U+0080 to U+07FF 
            cp = (*bytes & 0x1F);
            num = 2;
        }
        else if ((*bytes & 0xF0) == 0xE0)
        {
            // U+0800 to U+FFFF 
            cp = (*bytes & 0x0F);
            num = 3;
        }
        else if ((*bytes & 0xF8) == 0xF0)
        {
            // U+10000 to U+10FFFF 
            cp = (*bytes & 0x07);
            num = 4;
        }
        else
            return false;

        bytes += 1;
        for (int i = 1; i < num; ++i)
        {
            if ((*bytes & 0xC0) != 0x80)
                return false;
            cp = (cp << 6) | (*bytes & 0x3F);
            bytes += 1;
        }

        if ((cp > 0x10FFFF) ||
            ((cp >= 0xD800) && (cp <= 0xDFFF)) ||
            ((cp <= 0x007F) && (num != 1)) ||
            ((cp >= 0x0080) && (cp <= 0x07FF) && (num != 2)) ||
            ((cp >= 0x0800) && (cp <= 0xFFFF) && (num != 3)) ||
            ((cp >= 0x10000) && (cp <= 0x1FFFFF) && (num != 4)))
            return false;
    }

    return true;
}
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • how (*bytes & 0xE0) == 0xC0 gives range from 0x80 to 0x7ff....???it should give range from 0xc0 to 0xdf – ahmed allam Feb 20 '20 at 10:19
  • @ahmedallam no, what I wrote is correct. Look at the [bit pattern table](https://en.wikipedia.org/wiki/UTF-8) described on Wikipedia for UTF-8. Unicode codepoints U+0080 to U+07FF (not bytes 0xC0 to 0xDF) are encoded in 2 bytes using the bit pattern `110xxxxx 10xxxxxx`. 0xE0 is bits `11100000` and 0xC0 is bits `11000000`. So, `if ((*bytes & 0xE0) == 0xC0)` is checking if the high 3 bits of the 1st byte are `110` before `(*bytes & 0x1F)` grabs the low 5 bits. Then later, `((*bytes & 0xC0) != 0x80)` checks if the high 2 bits of the 2nd byte are `10` before `(*bytes & 0x3F)` grabs the low 6 bits. – Remy Lebeau Feb 20 '20 at 17:03
  • 1
    @ahmedallam seems you need to brush up on how bits, bit masks, and bitwise operators work. – Remy Lebeau Feb 20 '20 at 17:04
  • @RemyLebeau Is this exception/thread safe ? (noob question) – Mecanik Dec 11 '20 at 09:02
  • @NorbertBoros as long as the `string` parameter is pointing at a valid C-style null-terminated string, and that memory is not modified or freed by another thread while the function is running, then yes, the function is safe. Otherwise, its behavior is undefined. – Remy Lebeau Dec 11 '20 at 15:26
  • "illegal UTF-8 sequences, such as overlong encodings" - think illegal prefix UTF-8 11111xxxb is caught with your num-decoding. – Sam Ginrich Jan 01 '23 at 12:19
  • @SamGinrich yes, it is. The last `else if` in the `num` counter is checking for prefix `11110xxx`, which `11111xxx` will fail so the final `else` is reached to call `return false` – Remy Lebeau Jan 01 '23 at 18:43
9

You probably don't understand UTF-8 and the alternatives. There are only 256 possible values for a byte. That's not a lot, given the number of characters. As a result, many byte sequences are both valid UTF-8 strings and valid strings in other encodings.

In fact, every ASCII string is intentionally a valid UTF-8 string with essentially the same meaning. Your code would return true for ìs_utf8("Hello").

Even many other non-UTF8, non-ASCII strings share a byte sequences with valid UTF-8 strings. And there's just no way to convert a non-UTF-8 string to UTF-8 without knowing exactly what kind of non-UTF-8 encoding it is. Even Latin-1 and Latin-2 are already quite different. CP_ACP is even worse than Latin-1, CP_ACP isn't even the same everywhere.

Your text must go into the database as UTF-8. Thus, if it isn't yet UTF-8, it must be converted, and you must know the exact source encoding. There is no magical escape.

On Linux, iconv is the usual method to convert between 2 encodings.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • The question does not do the technical break down, but I think it is understandable, that a "UTF-8 stream" by it's grammar is a subclass of a "byte stream" and independent from the type of extension of the 7-Bit ASC II character set. Only the difference class is detectable. – Sam Ginrich May 04 '23 at 05:31