2

I'm coding in C++ on Linux (Ubuntu) and trying to print a string that contains some Latin characters.

Trying to debug, I have something like the following:

std::wstring foo = L"ÆØÅ";
std::wcout << foo;
for(int i = 0; i < foo.length(); ++i) {
    std::wcout << std::hex << (int)foo[i] << " ";
    std::wcout << (char)foo[i];
}

Characteristics of output I get:

  • The first print shows: ???
  • The loop prints the hex for the three characters as c6 d8 c5
  • When foo[i] is cast to char (or wchar_t), nothing is printed

Environmental variable $LANG is set to default en_US.UTF-8

Jorengarenar
  • 2,705
  • 5
  • 23
  • 60
George Hernando
  • 2,550
  • 7
  • 41
  • 61
  • 1
    What part is it that you have a question about? – Ted Lyngmo Jul 16 '20 at 01:44
  • [This answer](https://stackoverflow.com/a/402918/10247460) may shed some light on your problem – Jorengarenar Jul 16 '20 at 01:44
  • @ Ted Lyngmo The latin characters aren't properly printing to the console. They print as ? or are not printed at all. – George Hernando Jul 16 '20 at 01:50
  • The name of your variable is misleading, even though it's named `u8` it is *not* UTF-8. Casting those individual characters will not give you anything valid, you must do a full conversion. – Mark Ransom Jul 16 '20 at 01:52
  • @ Jorengarenar A long article. But it shows the const char[] as printing correctly. But when I typecast to (char), nothing is printed. – George Hernando Jul 16 '20 at 01:54
  • https://stackoverflow.com/q/148403/5987 may help. – Mark Ransom Jul 16 '20 at 01:55
  • Since you have *something* printed, I guess you are using `libstdc++`. It is possible to make it work if you set the locale properly. However `libc++` won't output any `wchar_t` that is not an ASCII character and I'm not aware of any plans to fix that. `wchar_t` is basically a dead end. Don't touch it with a six foot pole. – n. m. could be an AI Jul 16 '20 at 13:58

1 Answers1

0

In the conclusion of the answer I linked (which I still recommend reading) we can find:

When I should use std::wstring over std::string?

On Linux? Almost never, unless you use a toolkit/framework.

Short explanation why:

First of all, Linux is natively encoded in UTF-8 and is consequent in it (in contrast to e.g. Windows where files has one encoding and cmd.exe another).

Now let's have a look at such simple program:

#include <iostream>

int main()
{
    std::string  foo =  "ψA"; // character 'A' is just control sample
    std::wstring bar = L"ψA"; // --

    for (int i = 0; i < foo.length(); ++i) {
        std::cout  << static_cast<int>(foo[i]) << " ";
    }
    std::cout << std::endl;

    for (int i = 0; i < bar.length(); ++i) {
        std::wcout << static_cast<int>(bar[i]) << " ";
    }
    std::cout << std::endl;

    return 0;
}

The output is:

-49 -120 65 
968 65 

What does it tell us? 65 is ASCII code of character 'A', it means that that -49 -120 and 968 corresponds to 'ψ'.

In case of char character 'ψ' takes actually two chars. In case of wchar_t it's just one wchar_t.

Let's also check sizes of those types:

std::cout << "sizeof(char)    : " << sizeof(char)    << std::endl;
std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << std::endl;

Output:

sizeof(char)    : 1
sizeof(wchar_t) : 4

1 byte on my machine has standard 8 bits. char has 1 byte (8 bits), while wchar_t has 4 bytes (32 bits).

UTF-8 operates on, nomen omen, code units having 8 bits. There is is a fixed-length UTF-32 encoding used to encode Unicode code points that uses exactly 32 bits (4 bytes) per code point, but it's UTF-8 which Linux uses.

Ergo, terminal expects to get those two negatively signed values to print character 'ψ', not one value which is way above ASCII table (codes are defined up to number 127 - half of char possible values).

That's why std::cout << char(-49) << char(-120); will also print ψ.


But it shows the const char[] as printing correctly. But when I typecast to (char), nothing is printed.

The character was already encoded different, there are different values in there, simple casting won't be enough to convert them.

And as I've shown, size char is 1 byte and of wchar_t is 4 bytes. You can safely cast upward, not downward.

Jorengarenar
  • 2,705
  • 5
  • 23
  • 60