How can I make dynamic strings to work with UTF-8 in console?

Question

Most of answers and questions here on SO use to put L before any UTF-8 string. I found no explantion of what it is, in the source code, the constant is, according to my IDE, defined in winnt.h.

This is how I use it, without knowing what it is:

std::wcout<<L"\"Přetečení zásobníku\" is Stack overflow in Czech.";

Obviously, constant concatenation cannot be applied on variables:

void printUTF8(const char* str) {
  //Does not make the slightest bit of sense
  std::wcout<<L str; 
}

So what is it and how to add it to dynamic strings?

L is a 16 bit designator (mostly, it could be anything in theory) and a UTF-8 string is not 16-bit. — carveone, Jun 19 '14 at 12:54
There is another issue here which is the Console. I'll update my post but could you tell us what you are getting on the console with your example? — carveone, Jun 19 '14 at 13:08
Currently my code works - that's because I'm using the function that you've suggested already. However I'll reconsider using UTF-8. — Tomáš Zato, Jun 19 '14 at 13:16
I may have answered a question that wasn't asked then! If you have a UTF-8 string you can convert it to something that wcout wants with the MultiByteToWideChar() function. That would make a normal string into an "L" string. — carveone, Jun 19 '14 at 13:22

carveone · Answer 1 · 2014-06-19T13:11:26.973

L is an indication to the C compiler that the string is composed of "wide characters". In Windows, these would be UTF-16 - each character that you put in the string is 16 bits, or two bytes, wide:

L"This is a wide string"

In contrast, a UTF-8 string is always a string composed of bytes. ASCII characters (A-Z 0-9 etc) are encoded the way they have always been - in the range 0x00 to 0x7F (or 0 to 127). International characters (like ř) are encoded using multiple bytes in the range 0x80 to 0xFF - there is a very good explanation on wikipedia. The advantage is that it can be represented using ordinary C strings.

"This is an ordinary string, but also a UTF-8 string"

"This is a C cedilla in UTF-8: \xc3\x87"

However, if you are typing these international characters in to actual code, your editor needs to know that you are typing in UTF-8 so it can encode the characters correctly - like the C cedilla above. Then the string will be passed correctly to your function.

In your case, your comment indicates that you are using UTF-16. In which case there are two other issues:

The console will, by default, not output Unicode characters correctly. You need to change the font to a truetype font like Lucida Console
You also need to change the output mode to a Unicode UTF-16 one. You can do this with:

_setmode(_fileno(stdout), _O_U16TEXT);

Code example:

#include <iostream>
#include <io.h>
#include <fcntl.h>

int wmain(int argc, wchar_t* argv[])
{
    _setmode(_fileno(stdout), _O_U16TEXT);
    std::wcout << L"Přetečení zásobníku is Stack overflow in Czech." << std::endl;
}

Why is the `_setmode` required when you're using `wcout`? That doesn't seem right. — Mark Ransom, Jun 19 '14 at 15:00
I believe (but not sure) that otherwise the console will assume ANSI and convert the UTF-16 to garbage! — carveone, Jun 19 '14 at 15:24
@MarkRansom: the purpose of the wide streams is to convert to/from external byte-oriented encoding. well except that as i see it the design with 8 standard stream objects mapped to 3 OS byte streams, where there's no way to check whether any particular use somewhere is *invalid* (C level "orientation", wide or narrow), sucks. ;-) — Cheers and hth. - Alf, Jun 19 '14 at 15:32

score 1 · Answer 2 · answered Jun 19 '14 at 13:04

1

L"" is a WIDE string. That is to say, it's a a wchar_t[1]. UTF-8 strings can't be wide, since they are multi-byte (variable length). VC++ is slightly wrong and made wide strings variable length, UTF-16 to be precise. But usually they're UTF-32.

The problem with multi-byte strings is that there are many different encodings, and UTF-8 is only one of them. Windows does not in fact natively support UTF-8 encodings. MessageBoxA() for instance can use any encoding but UTF-8. There's just one exception to that, which is MultiByteToWideChar(CP_UTF8, ...) which is what you'd need here.

answered Jun 19 '14 at 13:04

MSalters

173,980
10
155
350

re "VC++ is slightly wrong and made wide strings variable length", that doesn't make sense to me. What did you mean to write? Are you perhaps referring to C99/C++11 rules, I guess resulting from politics / fanboyism, which came long after Windows and Visual C++? – Cheers and hth. - Alf Jun 19 '14 at 13:32
Narrow strings can be multi-byte (e.g. `mbclen` can be less than `strlen`). Wide strings can't be, each character should fit in one `wchar_t` and there's only `wcslen`. Windows of course defines `WCHAR`, which historically mapped to `unsigned short`. Mapping `WCHAR` to `wchar_t` is where things get complicated. – MSalters Jun 19 '14 at 15:10
I mostly agree with that, but it's misleading in two ways. First, the notion that conceptual character (displayed as a single graphic) is necessarily representable as a single 32-bit encoding value. That's simply not so in general. I.e., the C++ standard library is botched also with UTF-32: it's just a matter of degree, and for the politics, a matter of what one can make the ignorant masses believe. The second way it's a bit misleading is that it implies that Windows WCHAR was mapped to wchar_t at some point. I can't remember that it's not been so, ever. I.e., C99 is at fault. – Cheers and hth. - Alf Jun 19 '14 at 15:24
**UPDATE**: I find that the wording I thought was introduced in C99, was there already in C90. So, Microsoft is at fault, not C99 and Unix-land politics, as I maintained. Mea culpa! – Cheers and hth. - Alf Jun 19 '14 at 15:40
The first version of Windows NT was released in 1993; at that time, [16 bits was all you needed](http://www.unicode.org/faq/utf_bom.html) to encode the entire Unicode character set. The character set was extended in Unicode 2.0, introduced in 1996. By then I expect it was too late for Microsoft to change their definition of `wchar_t` and the best they could do was change the encoding from UCS-2 to UTF-16 ([as of Windows 2000](http://en.wikipedia.org/wiki/UCS-2#Use_in_major_operating_systems_and_environments)). – Harry Johnston Jun 20 '14 at 00:33

Cheers and hth. - Alf · Answer 3 · 2014-06-19T15:48:40.433

Re your actual question

” what is [the L prefix] and how to add it to dynamic strings?

This is very different from the title of the question at the time I’m writing this, namely “How can I make dynamic strings to work with UTF-8 in console?”

In short, UTF-8 is an encoding of Unicode where the basic encoding unit is 8 bits, commonly called a byte (more precisely it's an octet), while the L prefix forms a wide character or string literal, where the encoding unit typically is 16 or 32 bits – in Windows it’s 16 bits, as in original Unicode.

A wide character or string literal is based on the wchar_t type instead of char.

In Windows a wide string is encoded as UTF-16. The most common sixty thousand or so Unicode characters are represented with single wchar_t values, but some seldom used Chinese ideograms etc. require two successive wchar_t values, called a surrogate pair.

The use of 16 bit encoding unit in Windows was established around 1992. I am not sure when UTF-16 was adopted (as an extension of then UCS-2 encoding), it was just a bit later. So this was established long before C99 required that all characters of the wide character set should be representable with single wchar_t values. That requirement appears to have been a pure political maneuver, ensuring that no Windows C compiler could be formally conforming, a general ISO programming language standard that applied only to Unix-land. Unfortunately, since C++11 was based on C99 we now have that also in C++11, ensuring that no Windows C++ compiler can be fully conforming. Pure idiocy. If you ask me.

Errata, re deleted text above: according to Wikipedia’s article about it the wording about a single wchar_t being sufficient for any character in the “extended character set” was there already in C90. Which makes the incompatibility between Windows and the C and C++ standards the fault of Microsoft, not the fault of the C committee. It still appears to be political and fairly idiotic, but (enlightened) with others to blame than I maintained at first…

One way to work with wide dynamic strings is to use std::wstring, from the <string> header.

With Visual C++ you can use a wmain function instead of standard main, as an easy way to get wide command line arguments.

wmain is also supported by MinGW64 (IIRC) g++, although not yet by ordinary MinGW g++, as of g++ 4.8.something. It is however easy to implement in terms of the Windows API. Unless you require strict standard-conforming code that provides the special main function features such as ability to declare it with or without arguments, but hey, let's be practical about things.

Example that compiles fine with both Visual C++ 12.0 and g++ 4.8.2:

// Source encoding: UTF-8 with BOM.

#include <io.h>         // _setmode
#include <fcntl.h>      // _O_WTEXT

#include <iostream>     // std::wcout, std::endl
#include <string>       // std::wstring
using namespace std;

auto main()
    -> int
{
    _setmode( _fileno( stdin ), _O_WTEXT );
    _setmode( _fileno( stdout ), _O_WTEXT );

    wcout << L"Hi, what’s your name? ";
    wstring username;
    getline( wcin, username );
    wcout << L"Welcome to Windows C++, " << username << "!" << endl;
}

Note that with Windows ANSI source this won’t compile with g++ unless you specify the source encoding with the appropriate compiler option.

How can I make dynamic strings to work with UTF-8 in console?

3 Answers3

Linked