UNICODE, UTF-8 and Windows mess

Question

I'm trying to implement text support in Windows with the intention of also moving to a Linux platform later on. It would be ideal to support international languages in a uniform way but that doesn't seem to be easily accomplished when considering the two platforms in question. I have spent a considerable amount of time reading up on UNICODE, UTF-8 (and other encodings), widechars and such and here is what I have come to understand so far:

UNICODE, as the standard, describes the set of characters that are mappable and the order in which they occur. I refer to this as the "what": UNICODE specifies what will be available.

UTF-8 (and other encodings) specify the how: How each character will be represented in a binary format.

Now, on windows, they opted for a UCS-2 encoding originally, but that failed to meet the requirements, so UTF-16 is what they have, which is also multi-char when necessary.

So here is the delemma:

Windows internally only does UTF-16, so if you want to support international characters you are forced to convert to their widechar versions to use the OS calls accordingly. There doesn't seem to be any support for calling something like CreateFileA() with a multi-byte UTF-8 string and have it come out looking proper. Is this correct?
In C, there are some multi-byte supporting functions (_mbscat, _mbscpy, etc), however, on windows, the character type is defined as unsigned char* for those functions. Given the fact that the _mbs series of functions is not a complete set (i.e. there is no _mbstol to convert a multi-byte string to a long, for example) you are forced to use some of the char* versions of the runtime functions, which leads to compiler problems because of the signed/unsigned type difference between those functions. Does anyone even use those? Do you just do a big pile of casting to get around the errors?
In C++, std::string has iterators, but these are based on char_type, not on code points. So if I do a ++ on an std::string::iterator, I get the next char_type, not the next code point. Similarly, if you call std::string::operator[], you get a reference to a char_type, which has the great potential to not be a complete code point. So how does one iterate an std::string by code point? (C has the _mbsinc() function).

Not "multi-byte when necessary". It's just "multi-byte". You don't know whether it's "necessary" until you've started processing it. — Kerrek SB, Oct 26 '12 at 15:52
Here's a [post of mine](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability) on this subject; perhaps it's of interest to you. For (3), convert your data into UTF-32 (ideally stored in a `char32_t`), and then code points equal string elements. — Kerrek SB, Oct 26 '12 at 15:54
And bear in mind that there are few legitimate reasons to iterate a Unicode string by code points, because a grapheme may be represented by multiple code points (each of which can be multiple code units in UTF-8 or UTF-16, but for many practical purposes that's the same problem twice). Normalization is one legitimate reason, encoding to UTF-8 is another, but these are things for which you can use a library anyway. — Steve Jessop, Oct 26 '12 at 16:00
@SteveJessop I would imagine that attempting to do an insert in the middle of a string or a search and replace would require iteration to find the correct location and you'll obviously want to do it by code points, because if you simply go by char_type, you may end up splitting a multi-byte code point, which would be bad. — Murrgon, Oct 26 '12 at 16:08
@Murrgon: Actually, that's not true. If you do a search and replace, as long as both the needle and haystack are valid UTF-8, you can replace the needle with any valid UTF-8 without causing problems (except you may want to renormalize). Doing an insert at an arbitrary location takes work but it requires the same work in UTF-32 since you have to find grapheme cluster boundaries, not code point boundaries. Code point boundaries are useless for *almost* anything that you ever want to do. — Dietrich Epp, Oct 26 '12 at 16:12
@Murrgon: You don't want to do insertions based on code points either -- you want to do it by characters. If you insert a new character, you don't want it to go between a code point for an existing character and a code point for a diacritical that's supposed to combine with the existing character. — Jerry Coffin, Oct 26 '12 at 16:12
@JerryCoffin: Actually, you don't want to do insertions by character. You want to do it by grapheme cluster. For example, you don't want to insert a character between `n` and `~` in `señor`. (This particular case is solved with composition, but not all such characters can be composed.) — Dietrich Epp, Oct 26 '12 at 16:13
@DietrichEpp: That was my point (I was using "character" to refer to a complete grapheme cluster, though my wording was imprecise). — Jerry Coffin, Oct 26 '12 at 16:15
@DietrichEpp How does one determine grapheme cluster boundaries then? — Murrgon, Oct 26 '12 at 16:25
@Murrgon: Read [UAX #29](http://www.unicode.org/reports/tr29/). It's not simple. — Dietrich Epp, Oct 26 '12 at 16:26

score 10 · Answer 1 · answered Oct 26 '12 at 16:07

10

Just do UTF-8

There are lots of support libraries for UTF-8 in every plaftorm, also some are multiplaftorm too. The UTF-16 APIs in Win32 are limited and inconsistent as you've already noted, so it's better to keep everything in UTF-8 and convert to UTF-16 at last moment. There are also some handy UTF-8 wrappings for the windows API.

Also, at application-level documents, UTF-8 is getting more and more accepted as standard. Every text-handling application either accepts UTF-8, or at worst shows it as "ASCII with some dingbats", while there's only few applications that support UTF-16 documents, and those that don't, show it as "lots and lots of whitespace!"

answered Oct 26 '12 at 16:07

Javier

60,510
8
78
126

1

I would add a quite good reference, why UTF-8 should be used anywhere http://utf8everywhere.org/ – Anton Kochkov Dec 17 '16 at 11:02
"There are also some handy UTF-8 wrappings for the windows API." ... Such as? – jamesdlin Jun 30 '17 at 11:47
[Microsoft is making the Windows API increasingly UTF-8 capable.](https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page) (Take note, however, your application’s manifest must be properly configured!) – Dúthomhas Jan 23 '23 at 06:30

score 8 · Accepted Answer · edited May 23 '17 at 12:02

Correct. You will convert UTF-8 to UTF-16 for your Windows API calls.
Most of the time you will use regular string functions for UTF-8 -- strlen, strcpy (ick), snprintf, strtol. They will work fine with UTF-8 characters. Either use char * for UTF-8 or you will have to cast everything.

Note that the underscore versions like _mbstowcs are not standard, they are normally named without an underscore, like mbstowcs.
It is difficult to come up with examples where you actually want to use operator[] on a Unicode string, my advice is to stay away from it. Likewise, iterating over a string has surprisingly few uses:
- If you are parsing a string (e.g., the string is C or JavaScript code, maybe you want syntax hilighting) then you can do most of the work byte-by-byte and ignore the multibyte aspect.
- If you are doing a search, you will also do this byte-by-byte (but remember to normalize first).
- If you are looking for word breaks or grapheme cluster boundaries, you will want to use a library like ICU. The algorithm is not simple.
- Finally, you can always convert a chunk of text to UTF-32 and work with it that way. I think this is the sanest option if you are implementing any of the Unicode algorithms like collation or breaking.
See: C++ iterate or split UTF-8 string into array of symbols?

score 2 · Answer 3 · answered Oct 26 '12 at 16:04

Windows internally only does UTF-16, so if you want to support international characters you are forced to convert to their widechar versions to use the OS calls accordingly. There doesn't seem to be any support for calling something like CreateFileA() with a multi-byte UTF-8 string and have it come out looking proper. Is this correct?

Yes, that's correct. The *A function variants interpret the string parameters according to the currently active code page (which is Windows-1252 on most computers in the US and Western Europe, but can often be other code pages) and convert them to UTF-16. There is a UTF-8 code page, however AFAIK there isn't a way to programmatically set the active code page (there's GetACP to get the active code page, but not corresponding SetACP).

In C, there are some multi-byte supporting functions (_mbscat, _mbscpy, etc), however, on windows, the character type is defined as unsigned char* for those functions. Given the fact that the _mbs series of functions is not a complete set (i.e. there is no _mbstol to convert a multi-byte string to a long, for example) you are forced to use some of the char* versions of the runtime functions, which leads to compiler problems because of the signed/unsigned type difference between those functions. Does anyone even use those? Do you just do a big pile of casting to get around the errors?

The mbs* family of functions is almost never used, in my experience. With the exception of mbstowcs, mbsrtowcs, and mbsinit, those functions are not standard C.

In C++, std::string has iterators, but these are based on char_type, not on code points. So if I do a ++ on an std::string::iterator, I get the next char_type, not the next code point. Similarly, if you call std::string::operator[], you get a reference to a char_type, which has the great potential to not be a complete code point. So how does one iterate an std::string by code point? (C has the _mbsinc() function).

I think that mbrtowc(3) would be the best option here for decoding a single code point of a multibyte string.

Overall, I think the best strategy for cross-platform Unicode compatibility is to do everything in UTF-8 internally using single-byte characters. When you need to call a Windows API function, convert it to UTF-16 and always call the *W variant. Most non-Windows platforms use UTF-8 already, so that makes using those a snap.

Unfortunately, `mbrtowc` does not decode code points on Windows. — Dietrich Epp, Oct 26 '12 at 16:06

David Shang · Answer 4 · 2023-01-23T03:51:31.283

In Windows, you can call WideCharToMultiByte and MultiByteToWideChar to convert between UTF-8 string and UTF-16 string (wstring in Windows). Because Windows API do not use UTF-8, whenever you call any Windows API function that support Unicode, you have to convert string into wstring (Windows version of Unicode in UTF-16). And when you get output from Windows, you have to convert UTF-16 back to UTF-8. Linux uses UTF-8 internally, so you do not need such conversion. To make your code portable to Linux, stick to UTF-8 and provide something as below for conversion:

#if (UNDERLYING_OS==OS_WINDOWS)
 
using os_string = std::wstring;

std::string utf8_string_from_os_string(const os_string &os_str)
{
    size_t length = os_str.size();
    int size_needed = WideCharToMultiByte(CP_UTF8, 0, os_str, length, NULL, 0, NULL, NULL);
    std::string strTo(size_needed, 0);
    WideCharToMultiByte(CP_UTF8, 0, os_str, length, &strTo[0], size_needed, NULL, NULL);
    return strTo;
}

os_string utf8_string_to_os_string(const std::string &str)
{
    size_t length = os_str.size();
    int size_needed = MultiByteToWideChar(CP_UTF8, 0, str, length, NULL, 0);
    os_string wstrTo(size_needed, 0);
    MultiByteToWideChar(CP_UTF8, 0, str, length, &wstrTo[0], size_needed);
    return wstrTo;
}

#else

// Other operating system uses UTF-8 directly and such conversion is
// not required
using os_string = std::string;
#define utf8_string_from_os_string(str)    str
#define utf8_string_to_os_string(str)    str

#endif

To iterate over utf8 strings, two fundamental functions you need are: one to calculate the number of bytes for an utf8 character and the another to determine if the byte is the leading byte of a utf8 character sequence. The following code provides a very efficient way to test:

inline size_t utf8CharBytes(char leading_ch)
{
    return (leading_ch & 0x80)==0 ? 1 : clz(~(uint32_t(uint8_t(leading_ch))<<24));
}

inline bool isUtf8LeadingByte(char ch)
{
    return  (ch & 0xC0) != 0x80;
}

Using these functions, it should not be difficult to implement your own iterator over utf8 strings, one is for forwarding iterator, and another is for backward iterator.

UNICODE, UTF-8 and Windows mess

4 Answers4