1

I'm working with strings in C++, and have a question about how norwegian characters are treated.

If I run the following code;

int main()
{
    string norwegian = "BLÅBÆRSYLTETØY";
    for (auto &c : norwegian)
        cout << c << " => " << static_cast<int>(c) << endl;

    return 0;
}

the output at cmd becomes:

B => 66
L => 76
┼ => -59
B => 66
ã => -58
R => 82
S => 83
Y => 89
L => 76
T => 84
E => 69
T => 84
Ï => -40
Y => 89

Notice that the three norwegian characters are not printed correctly, and that the ASCII value is negative.

Is there any way to treat the string so that it uses the correct charactermap?

EDIT

The solution is to change the codepage from ANSI to UTF-7, which can be done by adding this before the code that does stringhandling;

system("chcp 65000");
Frode Lillerud
  • 7,324
  • 17
  • 58
  • 69
  • Let me guess: you're using the windows console with an ANSI codepage. – jaggedSpire Jan 23 '17 at 20:47
  • Trying to print unicode characters to the windows console using C++ is...complicated unless you're using the native wide encoded input and output streams `wcin` and `wcout`. Part of the problem is that the console will display the output characters according to the active codepage, instead of interpreting them as, say, UTF-8 encoded strings as one might expect. There *is* a UTF-8 codepage in windows (65001), but whether you want to actually switch codepages just for your application to run with proper output should be something you investigate. – jaggedSpire Jan 23 '17 at 20:55
  • Incidentally, you want to stick string literals containing unicode characters in a [unicode string literal](http://en.cppreference.com/w/cpp/language/string_literal): prepend the quoted text with u8 for utf-8 data stored as characters, u for utf-16 data stored as `char16_t`, U for utf-32 data stored as `char32_t`, and L for native wide encoded data stored in `wchar_t`. – jaggedSpire Jan 23 '17 at 21:00
  • Yes, you are right, I'm on Windows, and the default codepage is 850. I tried changing to wcout, and using u8 for the string literal, but it didn't really help. Using any of the other prefixes (u, U and L) gave compiler errors. And yes, it'll be too much work changing the codepage for this app, so I might just leave this be. – Frode Lillerud Jan 23 '17 at 21:34
  • You'd want to use L with wcout, but yeah, I checked and it *still* doesn't print the characters correctly on the console. (you probably got a compiler error because you need to store a wide string literal in a `wstring`, a utf-16 string literal in a `u16string` and a utf-32 string in a `u32string`.) Amusingly/infuriatingly, the latter two don't have a corresponding input/output mode--you have to manually convert to either narrow or native wide encoding with the stuff in [](http://en.cppreference.com/w/cpp/header/codecvt). – jaggedSpire Jan 23 '17 at 21:36
  • Actually, changing the codepage wasn't hard at all! I just added this before the loop, and it worked: system("chcp 65000"); – Frode Lillerud Jan 23 '17 at 21:41
  • That's utf-7, isn't it? How well does it work with utf-8 (65001)? – jaggedSpire Jan 23 '17 at 21:44
  • ("chcp 65001") just gives blocks with questionmark. Tried several of the different unicode string literal types. Seems like only ("chcp 65000") does what I want. – Frode Lillerud Jan 23 '17 at 21:51
  • Hm. Thanks for checking and telling me. I'll add it to my list of things to tell the people trying to print non-ascii characters to the windows console. :) – jaggedSpire Jan 23 '17 at 21:54
  • Thanks for guiding me down the right path :) – Frode Lillerud Jan 23 '17 at 22:01
  • [just found this](http://stackoverflow.com/questions/2849010/output-unicode-to-console-using-c) – jaggedSpire Jan 23 '17 at 22:03
  • Are you sure the code as posted has worked? There is nothing in this program that can possibly output a Unicode character that is not ASCII. – n. m. could be an AI Jan 23 '17 at 22:14
  • Absolutely, it's copied from the VS solution. What do you mean is wrong with it? – Frode Lillerud Jan 23 '17 at 22:16
  • Norwegian characters occupy more than one byte (`char` data type in C) in UTF-8, yet you are printing individual bytes. It is very much likely that you are not working with UTF-8 or UTF-7 or anything Unicode at all. – n. m. could be an AI Jan 24 '17 at 07:42
  • What can I say, other than "it works"... – Frode Lillerud Jan 24 '17 at 20:44

0 Answers0