C++ and UTF8 - Why not just replace ASCII?

Question

In my application I have to constantly convert string between std::string and std::wstring due different APIs (boost, win32, ffmpeg etc..). Especially with ffmpeg the strings end up utf8->utf16->utf8->utf16, just to open a file.

Since UTF8 is backwards compatible with ASCII I thought that I consistently store all my strings UTF-8 std::string and only convert to std::wstring when I have to call certain unusual functions.

This worked kind of well, I implemented to_lower, to_upper, iequals for utf8. However then I met several dead-ends std::regex, and regular string comparisons. To make this usable I would need to implement a custom ustring class based on std::string with re-implementation of all corresponding algorithms (including regex).

Basically my conclusion is that utf8 is not very good for general usage. And the current std::string/std::wstring is mess.

However, my question is why the default std::string and "" are not simply changed to use UTF8? Especially as UTF8 is backward compatible? Is there possibly some compiler flag which can do this? Of course the stl implemention would need to be automatically adapted.

I've looked at ICU, but it is not very compatible with apis assuming basic_string, e.g. no begin/end/c_str etc...

This breaks compatibility with the millions of existing programs which assume that the 8-bit encoding is something other than UTF-8. Some old programs assume that the 8-bit encoding is EBCDIC (which isn't even ASCII-compatible). More commonly, programs will [assume that it is Code Page 437](http://arep.med.harvard.edu/MapQuant/documentation/version-1.4/html/d3/d6/table__element_8h-source.html). In non-English speaking countries, the assumption is usually GB 2312 or Big5 or ShiftJIS or ISCII or 1256 or one of many other encodings tailored for specific languages. — Raymond Chen, Dec 06 '11 at 13:30
@Raymond Chen: Thank you, I think that answers my main question. — ronag, Dec 06 '11 at 13:33
Please do not reimplement `to_upper` and `to_lower`. These exist in the locale classes for a reason, and you cannot possibly in your lifetime learn enough about the different languages of the world to get these right. — Simon Richter, Dec 06 '11 at 13:36
@Simon: The trivial solution of simply converting the input std:string into std::wstring, run to_upper/lower and then convert back to std::string is difficult to get wrong. — ronag, Dec 06 '11 at 13:39
@SimonRichter: the locale classes cannot either. They work on a per character basis and some languages (such as greek) have non-bijective mapping so that `to_lower` is context dependent... — Matthieu M., Dec 06 '11 at 13:40
@Simon: As far as I understood locale will not work for utf8? — ronag, Dec 06 '11 at 13:40
The problem with this question is that `std::string` and `""` can be and often are UTF-8. Perhaps the asker wants this specified by the language and is really asking why it's not. — bames53, Dec 06 '11 at 13:52
Your premise is wrong. A the value of a `""` string literal is determined by the **execution character set**, while the value of a `u8""` literal is determined by **UTF-8**. Those are two distinct, disjoint, disconnected problem domains. — Kerrek SB, Dec 06 '11 at 13:59
@ronag, `locale` deals with abstract characters. If you have UTF-8 data, the difficulty is converting this input data to the abstract character set that happens to be Unicode most of the time but isn't guaranteed to be. — Simon Richter, Dec 06 '11 at 14:23

score 8 · Accepted Answer · answered Dec 06 '11 at 13:37

The main issue is the conflation of in-memory representation and encoding.

None of the Unicode encoding is really amenable to text processing. Users will in general care about graphemes (what's on the screen) while the encoding is defined in terms of code points... and some graphemes are composed of several code points.

As such, when one asks: what is the 5th character of "Hélène" (French first name) the question is quite confusing:

In terms of graphemes, the answer is n.
In terms of code points... it depends on the representation of é and è (they can be represented either as a single code point or as a pair using diacritics...)

Depending on the source of the question (a end-user in front of her screen or an encoding routine) the response is completely different.

Therefore, I think that the real question is Why are we speaking about encodings here?

Today it does not make sense, and we would need two "views": Graphemes and Code Points.

Unfortunately the std::string and std::wstring interfaces were inherited from a time where people thought that ASCII was sufficient, and the progress made didn't really solve the issue.

I don't even understand why the in-memory representation should be specified, it is an implementation detail. All a user should want is:

to be able to read/write in UTF-* and ASCII
to be able to work on graphemes
to be able to edit a grapheme (to manage the diacritics)

... who cares how it is represented? I thought that good software was built on encapsulation?

Well, C cares, and we want interoperability... so I guess it will be fixed when C is.

I imagine some users really do care about what the bytes of their string are. Not everyone wants to iterate over graphemes. Some want to iterate over code points. Some want actual code units. — Nicol Bolas, Jan 16 '12 at 18:53
@NicolBolas: I understand the need to work on codepoint or graphemes, but I do feel that encodings should be limited to the interfaces with the external world. It's like asking to work on a json or xml string directly: painful. — Matthieu M., Jan 17 '12 at 07:32

score 3 · Answer 2 · answered Dec 06 '11 at 13:22

3

You cannot, the primary reason for this is named Microsoft. They decided not to support Unicode as UTF-8 so the support for UTF-8 under Windows is minimal.

Under windows you cannot use UTF-8 as a codepage, but you can convert from or to UTF-8.

answered Dec 06 '11 at 13:22

sorin

161,544
178
535
806

But how can Microsoft affect how strings in C++ works? Doesn't Microsoft start first at the win api? – ronag Dec 06 '11 at 13:24
2

Yeap. BTW, in Linux the support for UTF-8 is transparent. I even use strings like `µs` in source code, all works great. – Dec 06 '11 at 13:24
4

UTF-8 wasn't around when Microsoft made Windows NT do Unicode. – Roger Lipscombe Dec 06 '11 at 13:27
@Roger - UTF-8 also makes things like strlen() potentially slow, UTF-16 (or at least wchar) is a lot easier – Martin Beckett Dec 06 '11 at 13:37
1

@MartinBeckett: `strlen()` counts the number of `char` elements in a string, not the number of characters. In a UTF-8 encoded string (or any other Ansi charset, for that matter), each `char` represents an encoded codeunit. You can use `strlen()` with UTF-8 strings, just know that it will count the number of UTF-8 codeunits, not characters. The same applies with UTF-16, BTW, since it is just another Unicode encoding. In a UTF-16 encoded string, each `wchar_t` represents an encoded codeunit. To count the number of actual characters, in any encoding, you have to decode the string to UTF-32 first. – Remy Lebeau Dec 06 '11 at 20:09
@Remy - yes that was my point, the length of the string, at least in terms of 'the number of characters wide it will be on screen' is hard in UTF-8 but trivial in wchar. So on a 1995 era box it wasn't practical to spend seconds repaging a document for the rare set of users who had UTF chars that didn't fit in wchar – Martin Beckett Dec 06 '11 at 21:05
1

@MartinBeckett: determining the length of the string onscreen is not any less trivial in UTF-16 than it is in UTF-8. They are just different encodings of the same Unicode data, and it is the Unicode data, and how fonts interpret it, that determines what the onscreen display will be like. Both have to be decoded either way. – Remy Lebeau Dec 06 '11 at 22:47
@remy - that's why Windows doesn't use UTF16 either, it uses wchar and the subset of utf16 that fits into 16bit – Martin Beckett Dec 07 '11 at 03:20
2

@MartinBeckett: yes, Windows does use UTF-16, and has since Windows 2000 (MSDN says as much). A single `wchar_t` can hold a UTF-16/UCS2 codeunit within the BMP only, but a string of multiple `wchar_t` can hold fully encoded UTF-16 codeunits and surrogate pairs that can encode the entire Unicode charset. The Win32 API supports and expects that. – Remy Lebeau Dec 07 '11 at 04:57

score 3 · Answer 3 · edited Dec 06 '11 at 20:04

There are two snags to using UTF8 on windows.

You cannot tell how many bytes a string will occupy - it depends on which characters are present, since some characters take 1 byte, some take 2, some take 3, and some take 4.
The windows API uses UTF16. Since most windows programs make numerous calls to the windows API, there is quite an overhead converting back and forward. ( Note that you can do a "non-unicode' build, which looks like it uses a utf8 windows api, but all that is happening is that the conversion back and forward on each call is hidden )

The big snag with UTF16 is that the binary representation of a string depends on the byte order in a word on the particular hardware the program is running on. This does not matter in most cases, except when strings are transmitted between computers where you cannot be sure that the other computer uses the same byte order.

So what to do? I uses UTF16 everywhere 'inside' all my programs. When string data has to be stored in a file, or transmitted from a socket, I first convert it to UTF8.

This means that 95% of my code runs simply and most efficiently, and all the messy conversions between UTF8 and UTF16 can be isolated to routines responsible for I/O.

C++ and UTF8 - Why not just replace ASCII?

3 Answers3

Linked