0

There are some posts on this matter but I wanted to double check. In Joel Spoolsky's article (link) one reads:

In C++ code we just declare strings as wchar_t ("wide char") instead of char and use the wcs functions instead of the str functions (for example wcscat and wcslen instead of strcat and strlen). To create a literal UCS-2 string in C code you just put an L before it as so: L"Hello".

My question is: Is what is written above, not enough to support Unicodes in a C++ app?

My confusions started when I couldn't output simple text like (in Russian):

wcout<<L"логин";

in console.

Also, recently I saw some code written for an embedded device where one person handles I think Unicode related strings using wchar_t.

Any help greatly appreciated.

Maroun
  • 94,125
  • 30
  • 188
  • 241
pseudonym_127
  • 357
  • 7
  • 16
  • What exactly do you mean by C++? Do you want to be portable, or do you only care about Windows. The quote in the question is old and Windows specific. – David Heffernan Jun 26 '13 at 11:53
  • 2
    Maybe [this can help](http://stackoverflow.com/q/17103925/420683)? – dyp Jun 26 '13 at 11:54
  • 2
    You probably just printed characters the console was incapable of printing. – chris Jun 26 '13 at 11:55
  • 1
    `wchar_t` doesn't really have anything to do with encoding AFAIK (wide character/string literals are using the extended source character set.. but that's about it). – dyp Jun 26 '13 at 11:56
  • You chould set the console (programmatically) to be able to output Russian text. Using wchar only is not enough. – SChepurin Jun 26 '13 at 11:57
  • ---> Just to clarify, currently I will be writing on some embedded device - so Windows, and Linux is not of concern here I think ... ps. I don't think I will have access to C++ 11 features for now – pseudonym_127 Jun 26 '13 at 11:58
  • The problem is that the encoding of narrow strings literals ("hello") and wide string literals (L"hello") is up to the compiler. Don't expect many guarantees from the Standard for that. And the output capabilities of the std IO device too isn't specified. Also, locales that support Unicode output aren't required to exist. -> you'll have to rely on platform-dependent code (or libraries) – dyp Jun 26 '13 at 12:00
  • Even embedded device needs some OS - either Windows CE or Linux clone. – SChepurin Jun 26 '13 at 12:05
  • @SChepurin What's your definition of an embedded device? (Why does it include the restriction that it has to have an OS?) – dyp Jun 26 '13 at 12:14
  • @DyP - Because it needs one to operate. If not mentioned otherwise. And, please, do not discuss "unimporatant" comments, but answer the question asked by OP. – SChepurin Jun 26 '13 at 12:23
  • @Dyp: Not really, that link is for C++ 11 – pseudonym_127 Jun 26 '13 at 12:31
  • @pseudonym_127 Well what is says is basically even in C++11 you don't really have "Unicode support" (whatever that is meant to be). As you'll **have to** use compiler- and/or OS-/platform-specific features/code, please tell us more about what you want to do (e.g. OS if any) – dyp Jun 26 '13 at 12:32
  • @Dyp: I don't have that information currently. When time will come I will make inquiry about that, and possibly also ask here if I have issues. I was just making some general inquiries on the topic of Unicodes and their use on C++ and hoping to get some insights ... – pseudonym_127 Jun 26 '13 at 12:46
  • @Dyp: What's the use of wchar_t and wstrings if Unicode "is not well" supported anyway in C++? – pseudonym_127 Jun 26 '13 at 13:02
  • @pseudonym_127 `wchar_t` is a Windows thing. (and kind of a Java thing). Unicode is really supported in C++11. – Massa Jun 26 '13 at 13:17
  • @Massa `wchar_t` is a C and C++ thing. It hasn't much to do with Unicode at all (was probably introduced before Unicode was developed / widespread); in C++11 there's `char16_t` and `char32_t` (as well as UTF-8 etc. string literal prefixes) but that's about it with the "support". As MSVC sets the `wchar_t` size to 16 bits, Windows headers use it to deal with UTF-16 strings. – dyp Jun 26 '13 at 13:19
  • @DyP my comment went with some parts missing, sorry. Trying again: Unicode is only mildly and basically supported, via string literals, in C++11. See my answer, below. `wchar_t` is (again, mild and basic) support for an encoding of unicode (UCS-2) that is not recommended anymore (no good Asian language support) and is in c++03 just because Windows and Java APIs loved it. IOW: if you want something in some 8-bit codepage, just use it in the the source code. If you want utf8, use c++11 and `u8` prefix. – Massa Jun 26 '13 at 13:23

2 Answers2

0

This works in C++11 on a linux, utf8 machine:

#include <iostream>

int main(int, char**) {
  std::cout << u8"Humberto Massa Guimarães\nлогин\n";
}
Massa
  • 8,647
  • 2
  • 25
  • 26
  • To not further pollute the comment section in the OP: `wchar_t` is only specified to be *at least* 16 bit wide. On some Linux systems with g++ (on many?) it is 32 bit wide. Nothing of this has anything to do with encoding. `char16_t` supersedes `wchar_t` as a container for UTF-16 because it is specified to be exactly 16 bit wide. – dyp Jun 26 '13 at 13:25
  • Also see [wikipedia's entry on wide characters](http://en.wikipedia.org/wiki/Wide_character) -> `wchar_t` has been introduced in C90 (-> not for windows most probably) – dyp Jun 26 '13 at 13:28
  • IIRC, @DyP, Windows and OS/2 were the first widespread ocidental OSs that got locales with 16-bit characters (at the time, without something that resembled the Internet, we only had heard about JIS encoding as a 16-bit encoding). Everyone else was using 8-bit encodings, mostly based on the IBM 8bit code pages. And yes, that was 1986 or so. I recall having a conversation with a college buddy of mine in 1988 about that and he was terrified "but all strings waste twice the memory! preposterous!". At the time we had 1.4MB floppies and 10MB HDDs. – Massa Jun 26 '13 at 14:57
-5

First, you can not print non-english characters in command-line

Second, briefly; UNICODE uses two bytes for every character and char uses single byte. For example string "ABC" will be stored in char as ABC\0 (3 bytes + end_of_string_character)

but in UNICODE will be stored as A\0B\0C\0\0\0 (6 + end_of_string_character which is two bytes like other characters)

For view some text, I suggest you to MessageBoxW:

First include windows header file: #include <windows.h>

Second use MessageBoxW API function:

MessageBoxW(0, L"UNICODE text body", L"title", MB_ICONINFORMATION);
Oyle Iste
  • 1
  • 2
  • 2
    First: No?! (even in windows you can print unicode characters on the command line) Second: No?! (even in UTF-16, there are 2-code-unit characters, IIRC the Supplementary Multilingual Plane) – dyp Jun 26 '13 at 12:28
  • Oh and I also fell for it :( -- there are no *Unicode characters*, only e.g. UTF-8-*encoded*-characters. What is meant of course is non-ASCII characters. – dyp Jun 26 '13 at 12:40
  • 1
    @DyP there are actually *sixteen* planes with code points that require 2 UTF-16 code units. About 94% of all Unicode code points require four bytes in UTF-16. About 55% of all currently assigned code points, as of Unicode 6.2, require four bytes in UTF-16. (Keep in mind, however, that this does not mean they show up 55% of the time) – R. Martinho Fernandes Jun 26 '13 at 12:46
  • I'm aware I'm 7 years late, but this needs to be done. "First, you can not print non-english characters in command-line" - yes you can. Some terminals might not support UTF8, but even Windows has a switch to enable the UTF8 codepage. "Second, briefly; UNICODE uses two bytes for every character and char uses single byte. For example string "ABC" will be stored in char as ABC\0 (3 bytes + end_of_string_character)" - no, UTF-16 requires 2 bytes per character. UTF-8 only needs one. UTF-8, 16, and 32 can all store the same characters, but they do so differently. – Zoe Jul 29 '20 at 10:38
  • Even if there's missing font or unicode support, the terminal will still print it. It doesn't care - it got told to print a character, and will happily do so, even if the output is a pile of garbage and can't be read because the characters have been replaced with boxes. With fonts and the right encoding, however, all the major terminals, regardless of which OS, will be able to print UTF8 – Zoe Jul 29 '20 at 10:40