Handling UTF-8 in C++

Question

To find out if C++ is the right language for a project of mine, I wanna test the UTF-8 capabilities. According to references, I built this example:

#include <string>
#include <iostream>

using namespace std;

int main() {
    wstring str;
    while(getline(wcin, str)) {
        wcout << str << endl;
        if(str.empty()) break;
    }

    return 0;
}

But when I type in an UTF-8 character, it misbehaves:

$ > ./utf8 
Hello
Hello
für
f
$ >

Not only it doesn't print the ü, but also quits immediately. gdb told me there was no crash, but a normal exit, yet I find that hard to believe.

Linux, actually. If it works on windows, too, that is kind of a bonus. — Lanbo, Dec 14 '11 at 23:39
Doesn't necessarily follow. Anyway, it works with normal `string`, `cin`, `cout`, not with the `w...` versions here, I suspect they want UTF-32 (or 16?). — Daniel Fischer, Dec 14 '11 at 23:54
Some of my previous questions on the topic: [#1](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability), [#2](http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c0x), [#3](http://stackoverflow.com/questions/7562609/what-does-cuchar-provide-and-where-is-it-documented) — Kerrek SB, Dec 14 '11 at 23:55

score 10 · Accepted Answer · edited May 23 '17 at 11:46

10

Don't use wstring on Linux.

std::wstring VS std::string

Take a look at first answer. I'm sure it answers your question.

When I should use std::wstring over std::string?

On Linux? Almost never (§).

On Windows? Almost always (§).

edited May 23 '17 at 11:46

Community

1
1

answered Dec 14 '11 at 23:55

robert petranovic

331
3
5

+1 : Take a look at this answer. I'm sure it links to an answer to your question. – Klaim Dec 14 '11 at 23:58
In the `boost::spirit` comments on UTF-8 they're always talking about using `wchar_t`. – Lanbo Dec 15 '11 at 00:01
@Scán: I'd guess they use `wchar_t` all the time for code points, used when translating UTF8 to and from anything. `wchar_t` is not a good character for UTF8 itself though. – Mooing Duck Dec 15 '11 at 00:04

score 10 · Answer 2 · edited Oct 06 '16 at 12:54

The language itself has nothing to do with unicode or any other character coding. It is tied to operating system. Windows uses UTF16 for unicode support which implies using wide chars (16-bit wide chars) - wchar_t or std:wstring. Each Win Api function operating with strings requires wide char input.

But unix-based systems i.e. Mac OS X or Linux use UTF8. Of course - it is only a matter of how you handle bytes in the array, so you can have UTF16 string stored in common C array or std:string container. This is why you do not see any wstrings in cross-platform code; instead all strings are handled as UTF8 and re-encoded when necessary to UTF16 (on windows).

You have more options how to handle this a bit confusing stuff. I personally do it as mentioned above - by strictly using UTF8 coding in all the application, re-encoding strings when interacting with Windows Api and directly using them on Mac OS X. For the win re-encoding I use great conversion helpers:

C++ UTF-8 Conversion Helpers (on MSDN, available under the Apache License, Version 2.0).

You can also use cross-platform Qt String which defines conversion functions from UTF8 to/from UTF16 and other codings (ANSI, Latin...).

So the answer above - on unix use always UTF8 (std::string, char), on Windows UTF16 (std::wstring, wchar_t) is true.

So what do you propose should I do when I want to make a language compiler/interpreter that treats everything as UTF-8 on both systems? — Lanbo, Dec 15 '11 at 09:57
Well, there is no simple answer and "ultimate" solution. It depends on what compilers, IDEs and APIs you use. I would recommend you to use some cross-platform application framework, ideally Qt by Nokia - http://qt.nokia.com. It is completely free for open source projects and even for commercial ones - if you ensure compliance with the GNU General Public License (LGPL). — vitakot, Dec 16 '11 at 11:06

score 4 · Answer 3 · edited May 17 '16 at 17:06

4

Remember that on startup of the main program, the "C" locale is selected as default. You probably don't want this if you handle utf-8. Calling setlocale(LC_CTYPE, "") turns off this default, and you get whatever is defined in the environment (presumably a utf-8 locale).

edited May 17 '16 at 17:06

Toby Speight

27,591
48
66
103

answered Feb 26 '12 at 11:38

nick

41
1

1

Yes! Contrary to some other answers, it is perfectly OK to use `wchar_t` on Linux. You absolutely have to use the right locale though. – n. m. could be an AI Feb 26 '12 at 12:26

Handling UTF-8 in C++

3 Answers3

Linked