C++ Unicode characters printing

Question

I need to print some Unicode characters on the Linux terminal using iostream. Strange things happen though. When I write:

cout << "\u2780";

I get: ➀, which is almost exactly what I want. However, if I write:

cout << '\u2780';

I get: 14851712.

The problem is, I don't know the exact character to be printed at compile time. Therefore I'd like to do something like:

int x;
// Some calculations...
cout << (char)('\u2780' + x);

Which prints: �. Using wcout or wchar_t instead don't work either. How do I get correct printing?

From what I found around on the Internet, it seems important that I use the GCC 4.7.2 compiler (executable g++) straight from the Debian 7 (Wheezy) repository.

are using wchar_t with operator `L` ? post your full code if possible or an [sscce.org](SSCCE) — pinkpanther, Jun 05 '13 at 16:17
If you do not want to mess with Unicode encodings, you could use a table to map strings to possible values of `x` instead of adding it. — dyp, Jun 05 '13 at 16:27
Possible duplicate of [How to print Unicode character in C++?](http://stackoverflow.com/questions/12015571/how-to-print-unicode-character-in-c) — Adrian McCarthy, Jan 27 '16 at 19:02

score 8 · Answer 1 · edited May 16 '23 at 13:20

8

The Unicode character \u2780 is outside of the range for the char datatype. You should have received this compiler warning to tell you about it: (at least my g++ 4.7.3 gives it)

test.cpp:6:13: warning: multi-character character constant [-Wmultichar]

If you want to work with characters like U+2780 as single units you'll have to use the widechar datatype wchar_t, or if you are lucky enough to be able to work with C++11, char32_t or char16_t. Note that one 16-bit unit is not enough to represent the full range of Unicode characters.

If that's not working for you, it's probably because the default "C" locale doesn't have support for non-ASCII output. To fix that problem you can call setlocale in the start of the program; that way you can output the full range of characters supported by the user's locale: (which may or may not have support for all of the characters you use)

#include <clocale>
#include <iostream>

using namespace std;

int main() {
    setlocale(LC_ALL, "");
    wcout << L'\u2780';
    return 0;
}

edited May 16 '23 at 13:20

Peter Mortensen

30,738
21
105
131

answered Jun 05 '13 at 16:15

Joni

108,737
14
143
193

Which of course might have the same problem other characters (SMP) if `sizeof(wchar_t) < 4`. I'd suggest using `char16_t` or `char32_t` btw. – dyp Jun 05 '13 at 16:18
2

additionally to the encoding prefix `L`, there's `u8` for `UTF8` encoding, `u` for `char16_t`, and `U` for `char32_t`. – Appleshell Jun 05 '13 at 16:24
`setlocale` when passing a `""` for the locale name sets the user's preferred locale, that is not necessarily a Unicode locale. – dyp Jun 05 '13 at 16:25
Thanks @DyP, I've added the note on the new character datatypes. – Joni Jun 05 '13 at 16:30
Though when using g++ on Linux, `wchar_t` is in fact a 32-bit Unicode code point. Nice to know if you care more about getting it working on Linux than being portable. – aschepler Jun 05 '13 at 16:34
1

@Sventimir IIRC they left out Unicode support for streams in C++11; there's no support for `wcout << ` with a `char16_t` and `char32_t`. You'll have to either do a custom conversion from those to the expected encoding of `wchar_t` or use unformatted output. – dyp Jun 05 '13 at 16:42
Thank you all. Unfrotunately nothing worked for me. `char16_t` and `char32_t` print decimal representation of the character on both `cout ` and `wcout`. Setting the `CL_ALL` locale does not work either. It seems I'll have to think of mapping int values to strings as DyP suggested instead. – Sventimir Jun 05 '13 at 16:47
@Sventimir, what's your locale (check with the `locale` command)? For me, with locale en_US.UTF-8, the test program above outputs the expected "➀" – Joni Jun 05 '13 at 16:54
Mine is pl_PL.UTF-8. That should not matter as long as it is UTF-8 I suppose? – Sventimir Jun 05 '13 at 17:08
If it's correctly installed and UTF-8 it shouldn't matter. If you can see Unicode output from other programs, such as `ls`, it should work. – Joni Jun 05 '13 at 17:12
It seems I do see proper output (such as national characters) from other programs (such as `ls`), but not from mine. Neither in the Konsole nor in Eclipse. – Sventimir Jun 05 '13 at 17:32
you might need `std::ios_base::sync_with_stdio(false);` if you want to use both `cout` and `wcout` in this case. – jfs May 21 '16 at 11:04

score 4 · Answer 2 · edited May 16 '23 at 13:41

When you write

cout << "\u2780";

The compiler converts \u2780 into the appropriate encoding of that character in the execution character set. That's probably UTF-8, and so the string ends up having four bytes (three for the character, one for the null terminator).

If you want to generate the character at run time then you need some way to do the same conversion to UTF-8 at run time that the compiler is doing at compile time.

C++11 provides a handy wstring_convert template and codecvt facets that can do this, however libstdc++, the standard library implementation that comes with GCC, has not yet gotten around to implementing them (as of GCC 4.8.0 (2013-03-22)). The following shows how to use these features, but you'll need to either use a different standard library implementation or wait for libstdc++ to implement them.

#include <codecvt>

int main() {
  char32_t base = U'\u2780';

  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
  std::cout << convert.to_bytes(base + 5) << '\n';
}

You can also use any other method of producing UTF-8 you have available. For example, iconv, ICU, and manual use of pre-C++11 codecvt_byname facets would all work. (I don't show examples of these, because that code would be more involved than the simple code permitted by wstring_convert.)

An alternative that would work for a small number of characters would be to create an array of strings using literals.

char const *special_character[] = { "\u2780", "\u2781", "\u2782",
  "\u2783", "\u2784", "\u2785", "\u2786", "\u2787", "\u2788", "\u2789" };

std::cout << special_character[i] << '\n';

score 2 · Answer 3 · edited May 16 '23 at 13:45

The program prints an integer because of C++11 §2.14.3/1:

A multicharacter literal, or an ordinary character literal containing a single c-char not representable in the execution character set, is conditionally-supported, has type int, and has an implementation-defined value.

The execution character set is what char can represent, i.e., ASCII.

You got 14851712, or in hexadecimal E29E80, which is the UTF-8 representation of U+2780 (DINGBAT CIRCLED SANS-SERIF DIGIT ONE). Putting UTF-8, a multibyte encoding, into an int is insane and stupid, but that's what you get from a "conditionally supported, implementation-defined" feature.

To get a UTF-32 value, use U'\u2780'. The first U specifies the char32_t type and UTF-32 encoding (i.e. up to 31 bits but no surrogate pairs). The second \u specifies a universal-character-name containing the code point. To get a value supposedly compatible with wcout, use L'\u2780', but that doesn't necessarily use a Unicode runtime value nor get you more than two bytes of storage.

As for reliably manipulating and printing the Unicode code point, as other answers have noted, the C++ standard hasn't quite gotten there yet. Joni's answer is the best way, yet it still assumes that the compiler and the user's environment are using the same locale, which often isn't true.

You can also specify UTF-8 strings in the source using u8"\u2780" and force the runtime environment to UTF-8 using something like std::locale::global( std::locale( "en_US.UTF-8" ) );. But that still has rough edges. Joni suggests using the C interface std::setlocale from <clocale> instead of the C++ interface std::locale::global from <locale>, which is a workaround to the C++ interface being broken in GCC on OS X and perhaps other platforms. The issues are platform-sensitive enough that your Linux distribution might well have put a patch into their own GCC package.

Either you or I probably missed something, because compiler now urges "U was not declared in the scope". — Sventimir, Jun 16 '13 at 07:11
@Sventimir Apparently it's not supported in GCC 4.7.2, but it's part of the C++11 standard. Just go with `L'xxx'`; in Linux it should do essentially the same thing. — Potatoswatter, Jun 16 '13 at 07:30
Adding C++11 support with `gcc --std=c++11` call does not work either. It now compiles, but prints decimal value of the char (10112), not the char itself. — Sventimir, Jun 16 '13 at 17:20

score 0 · Answer 4 · edited May 16 '23 at 13:50

0

In Linux, I have been successful printing out any Unicode directly as in the most naive way:

std::cout << "ΐ, Α, Β, Γ, Δ, Θ, Λ, Ξ, ... ±, ... etc."

edited May 16 '23 at 13:50

Peter Mortensen

30,738
21
105
131

answered Jan 09 '17 at 10:51

quanta

215
3
14

How does that answer the question? It doesn't even include [U+2780](https://www.utf8-chartable.de/unicode-utf8-table.pl?start=10104&number=128). – Peter Mortensen May 16 '23 at 13:53
[A similar unexplained answer](https://stackoverflow.com/questions/12015571/how-to-print-unicode-character-in-c/41546489#41546489). – Peter Mortensen May 16 '23 at 14:00

C++ Unicode characters printing

4 Answers4

Linked