82

I am trying to print a Russian "ф" (U+0444 CYRILLIC SMALL LETTER EF) character, which is given a code of decimal 1092. Using C++, how can I print out this character? I would have thought something along the lines of the following would work, yet...

int main (){
   wchar_t f = '1060';
   cout << f << endl;
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
James Raitsev
  • 92,517
  • 154
  • 335
  • 470
  • 2
    Note that the problem is two-fold (at least when it comes to a valid C++ program): expressing the character in code, and correctly passing it to `std::cout`. (And even when those two steps are done correctly it's a different matter altogether of correctly displaying the character inside whatever `std::cout` is connected to.) – Luc Danton Aug 18 '12 at 04:46
  • Does this answer your question? [Unicode encoding for string literals in C++11](https://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c11) – M.J. Rayburn Jun 24 '21 at 02:33

10 Answers10

78

To represent the character you can use Universal Character Names (UCNs). The character 'ф' has the Unicode value U+0444 and so in C++ you could write it '\u0444' or '\U00000444'. Also if the source code encoding supports this character then you can just write it literally in your source code.

// both of these assume that the character can be represented with
// a single char in the execution encoding
char b = '\u0444';
char a = 'ф'; // this line additionally assumes that the source character encoding supports this character

Printing such characters out depends on what you're printing to. If you're printing to a Unix terminal emulator, the terminal emulator is using an encoding that supports this character, and that encoding matches the compiler's execution encoding, then you can do the following:

#include <iostream>

int main() {
    std::cout << "Hello, ф or \u0444!\n";
}

This program does not require that 'ф' can be represented in a single char. On OS X and most any modern Linux install this will work just fine, because the source, execution, and console encodings will all be UTF-8 (which supports all Unicode characters).

Things are harder with Windows and there are different possibilities with different tradeoffs.

Probably the best, if you don't need portable code (you'll be using wchar_t, which should really be avoided on every other platform), is to set the mode of the output file handle to take only UTF-16 data.

#include <iostream>
#include <io.h>
#include <fcntl.h>

int main() {
    _setmode(_fileno(stdout), _O_U16TEXT);
    std::wcout << L"Hello, \u0444!\n";
}

Portable code is more difficult.

bames53
  • 86,085
  • 15
  • 179
  • 244
  • 6
    ? I'm pretty sure '\u0444' won't fit into a char unless the compiler has promoted the char to an int, but if you want that behavior, you should use an int. – Edward Falk Sep 04 '16 at 20:20
  • 1
    @EdwardFalk \u0444 will fit in an 8 bit `char` if the execution charset is, for example, ISO-8859-5. Specifically it will be the byte 0xE4. Note that I'm not suggesting that using such an execution charset is a good practice, I'm simply describing how C++ works. – bames53 Sep 05 '16 at 03:13
  • 1
    Ahhh, you're saying the compiler will recognize \u0444 as a unicode character, and convert it to the prevailing character set, and the result will fit in a byte? I didn't know it would do that. – Edward Falk Sep 05 '16 at 16:03
  • 1
    Yes. This is why using `\u` is different from using `\x`. – bames53 Mar 06 '17 at 02:29
  • doesn't work on my lubuntu 16 laptop with terminator terminal and g++ 5.4.0, using a std::string worked though – Austin_Anderson Oct 15 '17 at 18:22
18

When compiling with -std=c++11, one can simply

  const char *s  = u8"\u0444";
  cout << s << endl;
James Raitsev
  • 92,517
  • 154
  • 335
  • 470
  • 4
    Let me recommend [Boost.Nowide](http://cppcms.com/files/nowide/html/) for printing UTF-8 strings to terminal in a portable way, so the above code will be almost unchanged. – Yakov Galka Aug 30 '12 at 10:47
  • 2
    @ybungalobill, your comment deserves an answer on its own. Would you mind creating one? – Jorge Leitao Jan 06 '15 at 13:24
  • 1
    Just for my note: `\uXXXX` and `\UXXXXXXXX` are called *universal-character-name*. A string literal of the form `u8"..."` is *UTF-8 string literal*. Both are specified in the standard. – ynn Dec 27 '19 at 11:50
12

Ultimately, this is completely platform-dependent. Unicode-support is, unfortunately, very poor in Standard C++. For GCC, you will have to make it a narrow string, as they use UTF-8, and Windows wants a wide string, and you must output to wcout.

// GCC
std::cout << "ф";
// Windoze
wcout << L"ф";
Puppy
  • 144,682
  • 38
  • 256
  • 465
  • They should have Unicode escapes. I'm not familiar with the notation, though. – Puppy Aug 18 '12 at 03:36
  • 1
    IIRC, Unicode escapes are `\uXXXX` where the `XXXX` is for **hex** digits. Unfortunately, this leaves all the characters past U+FFFF out. – Mike DeSimone Aug 18 '12 at 03:39
  • Looking at http://jrgraphix.net/r/Unicode/0400-04FF, how should assignment play out `wchar_t x = '\u0400';` for instance, does not work – James Raitsev Aug 18 '12 at 03:40
  • 1
    @Mike: If you want past FFFF, you can do so by generating a UTF-16 surrogate pair yourself using two instances of `\u`, at least on windows. – Billy ONeal Aug 18 '12 at 03:41
  • 1
    The OP wants to specify the character in decimal, not hex, so string escapes are kind of useless. – Mark Ransom Aug 18 '12 at 03:42
  • I'll take hex. Really, i am just looking for a way to print any character in unicode and have it displayed as it should. Using cyrillic just as an example here – James Raitsev Aug 18 '12 at 03:43
  • 9
    @BillyONeal You do not use surrogate code points in C++ (in fact surrogate code points are completely prohibited). You use the format `\UXXXXXXXX`. – bames53 Aug 18 '12 at 03:46
  • 2
    GCC is not bound to use UTF-8, and is available for Windows. `std::wcout` is also an option outside of Windows. – Luc Danton Aug 18 '12 at 04:48
  • 2
    @Jam `'\u0400'` is a **narrow-character literal**. You seem to assume that `\u0400` exists in the execution character set. According to N3242 [lex.ccon]/5: "A universal-character-name is translated to the encoding, in the appropriate execution character set, of the character named. If there is no such encoding, the universal-character-name is translated to an implementation defined encoding." – curiousguy Aug 18 '12 at 05:01
9

This code works in Linux (C++11, Geany, and GCC 7.4 (g++. 2018-12-06)):

#include <iostream>

using namespace std;

int utf8_to_unicode(string utf8_code);
string unicode_to_utf8(int unicode);


int main()
{
    cout << unicode_to_utf8(36) << '\t';
    cout << unicode_to_utf8(162) << '\t';
    cout << unicode_to_utf8(8364) << '\t';
    cout << unicode_to_utf8(128578) << endl;

    cout << unicode_to_utf8(0x24) << '\t';
    cout << unicode_to_utf8(0xa2) << '\t';
    cout << unicode_to_utf8(0x20ac) << '\t';
    cout << unicode_to_utf8(0x1f642) << endl;

    cout << utf8_to_unicode("$") << '\t';
    cout << utf8_to_unicode("¢") << '\t';
    cout << utf8_to_unicode("€") << '\t';
    cout << utf8_to_unicode("") << endl;

    cout << utf8_to_unicode("\x24") << '\t';
    cout << utf8_to_unicode("\xc2\xa2") << '\t';
    cout << utf8_to_unicode("\xe2\x82\xac") << '\t';
    cout << utf8_to_unicode("\xf0\x9f\x99\x82") << endl;

    return 0;
}


int utf8_to_unicode(string utf8_code)
{
    unsigned utf8_size = utf8_code.length();
    int unicode = 0;

    for (unsigned p=0; p<utf8_size; ++p)
    {
        int bit_count = (p? 6: 8 - utf8_size - (utf8_size == 1? 0: 1)),
            shift = (p < utf8_size - 1? (6*(utf8_size - p - 1)): 0);

        for (int k=0; k<bit_count; ++k)
            unicode += ((utf8_code[p] & (1 << k)) << shift);
    }

    return unicode;
}


string unicode_to_utf8(int unicode)
{
    string s;

    if (unicode>=0 and unicode <= 0x7f)  // 7F(16) = 127(10)
    {
        s = static_cast<char>(unicode);

        return s;
    }
    else if (unicode <= 0x7ff)  // 7FF(16) = 2047(10)
    {
        unsigned char c1 = 192, c2 = 128;

        for (int k=0; k<11; ++k)
        {
            if (k < 6)
                c2 |= (unicode % 64) & (1 << k);
            else
                c1 |= (unicode >> 6) & (1 << (k - 6));
        }

        s = c1;
        s += c2;

        return s;
    }
    else if (unicode <= 0xffff)  // FFFF(16) = 65535(10)
    {
        unsigned char c1 = 224, c2 = 128, c3 = 128;

        for (int k=0; k<16; ++k)
        {
            if (k < 6)
                c3 |= (unicode % 64) & (1 << k);
            else if
                (k < 12) c2 |= (unicode >> 6) & (1 << (k - 6));
            else
                c1 |= (unicode >> 12) & (1 << (k - 12));
        }

        s = c1;
        s += c2;
        s += c3;

        return s;
    }
    else if (unicode <= 0x1fffff)  // 1FFFFF(16) = 2097151(10)
    {
        unsigned char c1 = 240, c2 = 128, c3 = 128, c4 = 128;

        for (int k=0; k<21; ++k)
        {
            if (k < 6)
                c4 |= (unicode % 64) & (1 << k);
            else if (k < 12)
                c3 |= (unicode >> 6) & (1 << (k - 6));
            else if (k < 18)
                c2 |= (unicode >> 12) & (1 << (k - 12));
            else
                c1 |= (unicode >> 18) & (1 << (k - 18));
        }

        s = c1;
        s += c2;
        s += c3;
        s += c4;

        return s;
    }
    else if (unicode <= 0x3ffffff)  // 3FFFFFF(16) = 67108863(10)
    {
        ;  // Actually, there are no 5-bytes unicodes
    }
    else if (unicode <= 0x7fffffff)  // 7FFFFFFF(16) = 2147483647(10)
    {
        ;  // Actually, there are no 6-bytes unicodes
    }
    else
        ;  // Incorrect unicode (< 0 or > 2147483647)

    return "";
}

More:

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Iro
  • 91
  • 1
  • 1
8

If you use Windows (note, we are using printf(), not cout):

// Save as UTF-8 without a signature
#include <stdio.h>
#include<windows.h>

int main (){
    SetConsoleOutputCP(65001);
    printf("ф\n");
}

It is not Unicode, but it is working—Windows-1251 instead of UTF-8:

// Save as Windows 1251
#include <iostream>
#include<windows.h>

using namespace std;

int main (){
    SetConsoleOutputCP(1251);
    cout << "ф" << endl;
}
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
vladasimovic
  • 310
  • 3
  • 5
3

'1060' is four characters, and won't compile under the standard. You should just treat the character as a number, if your wide characters match 1:1 with Unicode (check your locale settings).

int main (){
    wchar_t f = 1060;
    wcout << f << endl;
}
Mike DeSimone
  • 41,631
  • 10
  • 72
  • 96
  • I thought that was one of the points of iostreams: it would detect the type via overloaded `operator <<` and Do The Right Thing. Not so much, I guess? – Mike DeSimone Aug 18 '12 at 03:38
  • @Jam much of this is system dependent. What OS are you using? – Mark Ransom Aug 18 '12 at 03:43
  • 4
    `'1060'` is a multi-char character literal of type `int`, and is entirely legal under standard C++. It's value is implementation defined though. Most implementations will take the values of the characters and concatenate them to produce a single integral value. These are sometimes used for so-called 'FourCC's. – bames53 Aug 18 '12 at 03:49
  • I'm familiar with FourCC's (the original Mac OS used them everywhere), but every non-Mac compiler I've used emitted at least a warning when it hit a multibyte character constant. I doubt "entirely" legal would get a warning. – Mike DeSimone Aug 18 '12 at 03:51
  • 3
    Perhaps you'd be surprised how many warnings there are for entirely legal code. The C++ standard says "An ordinary character literal that contains more than one c-char is a multicharacter literal. A multicharacter literal has type int and implementation-defined value." [lex.ccon] 2.14.3/1 – bames53 Aug 18 '12 at 04:01
  • 2
    @MikeDeSimone "_every non-Mac compiler I've used emitted at least a warning_" because it is 1) almost never used on purpose on non-Mac systems 2) not a portable construct – curiousguy Aug 18 '12 at 04:53
  • Both GCC and MSVC do the same concatenation of bytes, so it's at least that portable. Actually, I'm not aware of any compiler that _doesn't_ do it. – bames53 Aug 18 '12 at 05:02
  • @curiousguy That was my point. And if a compiler is going to bother emitting a warning, I assume it's there for a reason and I'll avoid it. Note that bit fields are just as implementation-defined, yet generate no warnings. – Mike DeSimone Aug 18 '12 at 12:11
  • @MikeDeSimonek "_Note that bit fields are just as implementation-defined_" No, they are not. – curiousguy Aug 18 '12 at 13:40
  • @bames53 "_Both GCC and MSVC do the same concatenation of bytes, so it's at least that portable._" OK, so I retract my non-portable statement. (But it isn't guaranteed by the standard.) Anyway, I love multi-character literals! – curiousguy Aug 18 '12 at 13:42
  • @curiousguy Last I checked, two things about bit fields were implementation-defined: 1) whether consecutive bit fields were packed from high-order bits to low, or low-to-high, and 2) if the total number of bits was less than the storage type, which end (MSBs or LSBs) would get the padding bits. Also see http://www.linuxforu.com/2012/01/joy-of-programming-understanding-bit-fields-c/ and http://yarchive.net/comp/linux/bitfields.html So if the standard bothered to nail down these two issues, please quote it. – Mike DeSimone Aug 19 '12 at 14:10
  • @MikeDeSimone: It doesn't nail those down. But you can use bitfields entirely portably with standard defined behaviour (and *probably* less memory consumption than if you used just a bunch of ints). That is not true of multicharacter literals. (Note: If you are using bitfields to try to match some externally defined storage format, the standard will not help you - so "don't do that then".) – Martin Bonner supports Monica Dec 08 '16 at 11:52
1

I needed to show the string in the UI as well as save that to an XML configuration file. The above specified format is good for string in c++, I would add we can have the xml compatible string for the special character by replacing "\u" by "&#x" and adding a ";" at the end.

For example:

C++: "\u0444" → XML : "&#x0444;"

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
MGR
  • 313
  • 3
  • 14
1

Special thanks to the answer here for more-or-less the same question.

For me, all I needed was setlocale(LC_ALL, "en_US.UTF-8");

Then, I could use even raw wchar_t characters.

Andrew
  • 5,839
  • 1
  • 51
  • 72
0

In Linux, I can just do:

std::cout << "ф";

I just copy-pasted characters from here and it didn't fail for at least the random sample that I tried on.

quanta
  • 215
  • 3
  • 14
0

Another solution in Linux:

string a = "Ф";
cout << "Ф = \xd0\xa4 = " << hex
     << int(static_cast<unsigned char>(a[0]))
     << int(static_cast<unsigned char>(a[1])) << " (" << a.length() << "B)" << endl;

string b = "√";
cout << "√ = \xe2\x88\x9a = " << hex
     << int(static_cast<unsigned char>(b[0]))
     << int(static_cast<unsigned char>(b[1]))
     << int(static_cast<unsigned char>(b[2])) << " (" << b.length() << "B)" << endl;