5

How can I print a string like this: €áa¢cée£ on the console/screen? I tried this:

#include <iostream>    
#include <string>
using namespace std;

wstring wStr = L"€áa¢cée£";

int main (void)
{
    wcout << wStr << " : " << wStr.length() << endl;
    return 0;
}

which is not working. Even confusing, if I remove from the string, the print out comes like this: ?a?c?e? : 7 but with in the string, nothing gets printed after the character.

If I write the same code in python:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

wStr = u"€áa¢cée£"
print u"%s" % wStr

it prints out the string correctly on the very same console. What am I missing in c++ (well, I'm just a noob)? Cheers!!


Update 1: based on n.m.'s suggestion
#include <iostream>
#include <string>
using namespace std;

string wStr = "€áa¢cée£";
char *pStr = 0;

int main (void)
{
    cout << wStr << " : " << wStr.length() << endl;

    pStr = &wStr[0];
    for (unsigned int i = 0; i < wStr.length(); i++) {
        cout << "char "<< i+1 << " # " << *pStr << " => " << pStr << endl;
        pStr++;
    }
    return 0;
}

First of all, it reports 14 as the length of the string: €áa¢cée£ : 14 Is it because it's counting 2 byte per character?

And all I get this:

char 1 # ? => €áa¢cée£
char 2 # ? => ??áa¢cée£
char 3 # ? => ?áa¢cée£
char 4 # ? => áa¢cée£
char 5 # ? => ?a¢cée£
char 6 # a => a¢cée£
char 7 # ? => ¢cée£
char 8 # ? => ?cée£
char 9 # c => cée£
char 10 # ? => ée£
char 11 # ? => ?e£
char 12 # e => e£
char 13 # ? => £
char 14 # ? => ?

as the last cout output. So, actual problem still remains, I believe. Cheers!!


Update 2: based on n.m.'s second suggestion

#include <iostream>
#include <string>

using namespace std;

wchar_t wStr[] = L"€áa¢cée£";
int iStr = sizeof(wStr) / sizeof(wStr[0]);        // length of the string
wchar_t *pStr = 0;

int main (void)
{
    setlocale (LC_ALL,"");
    wcout << wStr << " : " << iStr << endl;

    pStr = &wStr[0];
    for (int i = 0; i < iStr; i++) {
       wcout << *pStr << " => " <<  static_cast<void*>(pStr) << " => " << pStr << endl;
       pStr++;
    }
    return 0;
}

And this is what I get as my result:

€áa¢cée£ : 9
€ => 0x1000010e8 => €áa¢cée£
á => 0x1000010ec => áa¢cée£
a => 0x1000010f0 => a¢cée£
¢ => 0x1000010f4 => ¢cée£
c => 0x1000010f8 => cée£
é => 0x1000010fc => ée£
e => 0x100001100 => e£
£ => 0x100001104 => £
 => 0x100001108 => 

Why there it's reported as 9 than 8? Or this is what I should expect? Cheers!!

MacUsers
  • 2,091
  • 3
  • 35
  • 56
  • What is the encoding of your source code file? ASCII? – selalerer Jul 23 '11 at 10:33
  • Possible duplicate : http://stackoverflow.com/questions/331690/c-source-in-unicode – BenjaminB Jul 23 '11 at 10:35
  • @selalerer: "encoding of the source file" - like `# -*- coding: utf-8 -*-` in python? How do I know or set that in c++? I just use vim to write the script. Cheers!! – MacUsers Jul 23 '11 at 10:42
  • 1
    @Mac Every source file is just a text file. Every text file has some encoding, it can be some thing based on the ASCII table (in which every character is one byte) or UTF-8 or UTF-16 etc... Today every text editor supports saving the file in which ever encoding you choose. How to do this in vim? http://stackoverflow.com/questions/778069/how-can-i-change-a-files-encoding-with-vim – selalerer Jul 23 '11 at 16:13
  • @selalerer: this what it is: `uniTest.cpp: UTF-8 Unicode c program text`. I'd be surprised if it wasn't. The vim is being used in the very same way for every thing, whilst, e.g. python works but c++ not. Anything else do you thing still missing? cheers! – MacUsers Jul 23 '11 at 16:27

1 Answers1

8

Drop the L before the string literal. Use std::string, not std::wstring.

UPD: There's a better (correct) solution. keep wchar_t, wstring and the L, and call setlocale(LC_ALL,"") in the beginning of your program.

You should call setlocale(LC_ALL,"") in the beginning of your program anyway. This instructs your program to work with your environment's locale, instead of the default "C" locale. Your environment has a UTF-8 one so everything should work.

Without calling setlocale(LC_ALL,""), the program works with UTF-8 sequences without "realizing" that they are UTF-8. If a correct UTF-8 sequence is printed on the terminal, it will be interpreted as UTF-8 and everything will look fine. That's what happens if you use string and char: gcc uses UTF-8 as a default encoding for strings, and the ostream happily prints them without applying any conversion. It thinks it has a sequence of ASCII characters.

But when you use wchar_t, everything breaks: gcc uses UTF-32, the correct re-encoding is not applied (because the locale is "C") and the output is garbage.

When you call setlocale(LC_ALL,"") the program knows it should recode UTF-32 to UTF-8, and everything is fine and dandy again.

This all assumes that we only ever want to work with UTF-8. Using arbitrary locales and encodings is beyond the scope of this answer.

n. m. could be an AI
  • 112,515
  • 14
  • 128
  • 243
  • WOW!! that really works. That makes me asking another question: What does `wstring` (hence, also `wchar_t` I think) actually for then? Cheers!! – MacUsers Jul 23 '11 at 13:16
  • 1
    `wchar_t` is a nebulous type that is "big enough to hold any character from your system's character set", but it's entirely up to your platform what to do with that. Usually you have to interface it with the environment using `mbstowcs`/`wcstombs` functions, or `%Ls` in `printf`, etc. [See here](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability) for a little rant of mine on the subject, or [use C++0x](http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c0x) for explicit Unicode strings. – Kerrek SB Jul 23 '11 at 14:21
  • @n.m.: Well, I'm not really very sure now if it's really working: If I try something like this: `cout << *pStr << " => " << pStr << endl;` - it prints this: `? => €áa¢cée£` on the console. `pStr` is a pointer of type char. I've updated my original post with new my modified script. Cheers!! – MacUsers Jul 23 '11 at 14:30
  • @Kerrek SB: Thanks for the links. Is `C++0x` supported on any platform/OS? Cheers!! – MacUsers Jul 23 '11 at 14:46
  • C++0x is supported by GCC 4.3 and up, and by MSVS2010. Don't know about other compilers... – Kerrek SB Jul 23 '11 at 14:49
  • I have edited the answer with another suggestion, hopefully more correct than the previous one! I was in a hurry and didn't finish it properly the first time around. – n. m. could be an AI Jul 23 '11 at 16:50
  • @n.m.: I get: `error: expected constructor, destructor, or type conversion before ‘(’ token` on the setlocale(LC_ALL,"") line. I added this line right after the including the headers. Am I doing anything wrong? Most importantly, what does that error mean? Cheers!! – MacUsers Jul 23 '11 at 18:19
  • 1. You need to include an additional header, `#include ` 2. You need to call `setlocale` inside the `main()` function. You cannot use statements at the file level in C++, only declarations are permitted there. The error is a bit cryptic. The compiler tried to interpet the statement as a declaration, but gave up in the middle. It says what kind of input it expected at that point. – n. m. could be an AI Jul 23 '11 at 18:32
  • aahhh...... I get it now. Although I already included the header, it's actually working without it for me. It almost worked, except one small problem: I just see another problem: a [white] space is being added to the end and `9` is being reported as the length of the string. Why this is happening? I've updated my original post with the new modification. Cheers!! – MacUsers Jul 23 '11 at 18:53
  • Hm, I don't know why you get 9 and a white space. I get 8 and no white space as expected. Perhaps you have a weird locale. What your `locale` linux command says, and also `echo $LANG` ? UPD: I know why, you're printing the terminating NULL character. Don't. Use std::wstring for simplicity, don't muck with C++ arrays and pointers and NULL-terminated strings. – n. m. could be an AI Jul 23 '11 at 19:01
  • `en_GB.UTF-8` is being reported (as I expect) for both `$LANG` and all the ENVIRONMENTs of `locale`. – MacUsers Jul 23 '11 at 19:07
  • I think, my `int iStr` calculation is wrong. if I change it to `wstring wStr = L"€áa¢cée£";` and `int iStr = wStr.length();` - everything works just fine as expected. Cheers!! – MacUsers Jul 23 '11 at 19:17