23

I am trying to use the std::locale mechanism in C++11 to count words in different languages. Specifically, I have std::wstringstream which contains the title of a famous Russian novel ("Crime and Punishment" in English). What I want to do is to use the appropriate locale (ru_RU.utf8 on my Linux machine) to read the stringstream, count the words and print the results. I should also probably note that my system is set to use the en_US.utf8 locale.

The desired result is this:

0: "Преступление"
1: "и"
2: "наказание"

I counted 3 words.
and the last word was "наказание"

That all works when I set the global locale, but not when I attempt to imbue the wcout stream. When I try that, I get this result instead:

0: "????????????"
1: "?"
2: "?????????"

I counted 3 words.
and the last word was "?????????"

Also, when I attempt to use a solution suggested in the comments, (which can be activate by changing #define USE_CODECVT 0 to #define USE_CODECVT 1) I get the error mentioned in this other question.

Those interested in experimenting with the code, or with compiler settings or both may wish to use this live code.

My questions

  1. Why does that not work? Is it because wcout is already open?
  2. Is there way to use imbue rather than setting the global locale to do what I want?

If it makes a difference, I'm using g++ 4.8.3. The full code is shown below.

getwords.cpp

#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <locale>

#define USE_CODECVT 0
#define USE_IMBUE   1

#if USE_CODECVT
#include <codecvt>
#endif 
using namespace std;

int main()
{
#if USE_CODECVT
    locale ru("ru_RU.utf8", 
        new codecvt_utf8<wchar_t, 0x10ffff, consume_header>{});
#else
    locale ru("ru_RU.utf8");
#endif
#if USE_IMBUE
    wcout.imbue(ru);
#else
    locale::global(ru);
#endif
    wstringstream in{L"Преступление и наказание"};
    in.imbue(ru);
    wstring word;
    unsigned wordcount = 0;
    while (in >> word) {
        wcout << wordcount << ": \"" << word << "\"\n";
        ++wordcount;
    }
    wcout << "\nI counted " << wordcount << " words.\n"
        << "and the last word was \"" << word << "\"\n";
}
Community
  • 1
  • 1
Edward
  • 6,964
  • 2
  • 29
  • 55
  • 1
    Try installing a utf8 converting facet into the locale: `locale ru{"ru_RU.utf8", new std::codecvt_utf8{}};`. This requires the `` header. – David G Oct 15 '14 at 17:00
  • Unfortunately, that doesn't compile here. See [this question](http://stackoverflow.com/questions/15615136/is-codecvt-not-a-std-header) for my exact symptoms. I don't know of a workaround using g++. – Edward Oct 15 '14 at 18:51
  • 2
    UTF-8 is not locale-dependent. It can represent any Unicode codepoint, used by any language. I don't think the problem lies in the conversion performed by `wcout`. I'd check two things. First, whether the string literal makes its way into the binary intact. Do `wcout << (int)L'П';` - this should print `1055`; if it doesn't, the character is mangled by the compiler. Second, whether the console is set up to display non-English characters. Redirect output to a file, examine it with hex viewer. Cyrillic `'П'` should be represented as two bytes `D0 9F` – Igor Tandetnik Oct 15 '14 at 23:43
  • Redirecting the output to a file makes no difference, and the characters are correctly represented in the string. I added a new last line to the program `wcout << "The first letter of the last word is U+0" << hex << (int)(word[0]) << " (" << word[0] << ")\n";` which prints `The first letter of the last word is U+043d (?)` – Edward Oct 16 '14 at 10:56
  • @0x499602D2 I'd prefer a non-Boost answer, but any answer would be appreciated. – Edward Oct 16 '14 at 14:54
  • `Redirecting the output to a file makes no difference` What byte(s) appear in the file where `П` should be? – Igor Tandetnik Oct 16 '14 at 14:59
  • @IgorTandetnik: a single byte `3f` which is corresponds to the `?` character. – Edward Oct 16 '14 at 15:16
  • Just to be sure, what OS are you using, and if under Windows what version ? – Serge Ballesta Oct 21 '14 at 14:29
  • @SergeBallesta: I'm using Linux 3.16.4, Fedora 20 distribution. – Edward Oct 21 '14 at 14:40
  • To confirm: you did `wcout << (int)L'П';` and it printed `1055`? Your print statement checks the state of the `wstring` after it round-trips through a `wstringstream` you have `imbue`d, which is a different test. ("the characters are correctly represented in the string" might imply this, but it does not say what you did to determine they where correctly represented) – Yakk - Adam Nevraumont Oct 21 '14 at 15:45
  • @Yakk: Yes, it printed 1055. You can also demonstrate this for yourself by using this live code as posted in the question: http://coliru.stacked-crooked.com/a/6d7bc409f511b0ae – Edward Oct 21 '14 at 16:57

3 Answers3

18

First I did some more test using your code and I can confirm that L"Преступление и наказание" is a correct UTF16 string. I controlled the code of the individual characters, and they are correctly 0x41f, 0x440, 0x435, 0x441, 0x442, 0x443, 0x43f, 0x43b, 0x435, 0x43d, 0x438, 0x435, 0x20, 0x438, 0x20, 0x43d, 0x430, 0x43a, 0x430, 0x437, 0x430, 0x43d, 0x438, 0x435

I could not find any reference about it, but it looks like simply calling imbue is not enough. imbue it a method from basic_ios which is an ancestor of cout and wcout. It does act on numeric conversions, but on all my tests, it has no effect on the charset used for output.

By default, the locale used in a C++ (or C) program is ... the C locale which knows nothing about unicode. All printable ASCII characters (below 128) are outputted as is, and others are replaced with a ?. It is exactly what your program does.

To make it work correctly, you have to select a locale that knows about unicode characters with setlocale. Once this is done, you can change the numeric conversion by calling imbue, and as you selected a unicode charset all will be fine.

So provided your current locale uses an UTF-8 charset, you only have to add

setlocale(LC_ALL, "");

as first line in your program, and the output will be as expected :

0: "Преступление"
1: "и"
2: "наказание"

I counted 3 words.
and the last word was "наказание"

If your current locale does not use UTF-8, choose one that is installed on you system and that supports it. I used setlocale(LC_ALL, "fr_FR.UTF-8");, or even setlocale(LC_ALL, "en_US.UTF-8"); and both worked.

Edit :

In fact, the best way to correctly output unicode to screen is to use setlocale(LC_ALL, "");. It automatically adapts to the current charset. I tested with a stripped down variant using Latin1 charset (my system speaks natively french and not russian ...)

#include <iostream>
#include <locale>

using namespace std;

int main() {
    setlocale(LC_ALL, "");
    wchar_t ws[] = { 0xe8, 0xe9, 0 };

    wcout << ws << endl;
}

I tried it under Linux using UTF-8 charset and ISO-8859-1 (latin1) (resp export LANG=fr_FR.UTF-8 and export LANG=fr_FR.ISO-8859-1) and I got correctly èé in the proper charset. I tried it also under Windows XP, with codepage 851 (oem) and 1252 (ansi) (resp. chcp 850 and chcp 1252 with Lucida console charset), and got èé on the console too.

Edit 2 :

Of course, you can also set a global C++ locale with locale::global(locale(""); with default locale or locale::global(locale("ru_RU.UTF-8"); with russian locale, but it is more than simply calling setlocale. According to the documentation of Gnu implementation of C++ Standard Library about locale : there is only one relation (of the C++ locale mechanism) to the C locale mechanism: the global C locale is modified if a named C++ locale object is set as the global locale", that is: std::locale::global(std::locale("")); affects the C functions as if the following call was made: std::setlocale(LC_ALL, "");. On the other hand, there is no vice versa, that is, calling setlocale has no whatsoever on the C++ locale mechanism, in particular on the working of locale("").

So it really looks like there was an underlying C library mechanizme that should be first enabled with setlocale to allow imbue conversion to work correctly.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • I appreciate your effort. Unfortunately, it seems to me that `setlocale(LC_ALL, "");` is not substantially different from the `locale::global(ru);` that I had in the original code. – Edward Oct 22 '14 at 00:42
  • Life Saver! `setlocale(LC_ALL, "");` works! – yu yang Jian Apr 14 '21 at 14:44
15

In this answer, I'm taking the questions in reverse order, and adding another (with answer) that came up along the way.

Is there way to use imbue rather than setting the global locale to do what I want?

Yes. By default, std::wcout is synchronized to the underlying stdout C stream. So std::wcout can use imbue if that synchronization is turned off, allowing the C++ stream to operate independently. So to modify the original code to use imbue and work as intended only a single line need be added, calling std::ios_base::sync_with_stdio:

std::ios_base::sync_with_stdio(false);
std::wcout.imbue(ru);

Why didn't the original version work?

The standard (I'm referring to INCITS/ISO/IEC 14882-2011[2012]) says very little about the tie to the underlying stdio stream, but in 27.4.3 it says

The object wcout controls output to a stream buffer associated with the object stdout, declared in <cstdio>

Further, without explicitly setting a global locale, the locale is the "C" locale which is US English ASCII, so this appears to imply that stdout will, by default, have an ASCII mapping. Since no Cyrillic characters are represented in ASCII, the underlying stdout is what converts the proper Russian into a series of ? characters.

Why must the sync_with_stdio call precede imbue?

According to 27.5.3.4 of the standard:

If any input or output operation has occurred using the standard streams prior to the call, the effect is implementation-defined. Otherwise, called with a false argument, it allows the standard streams to operate independently of the standard C streams.

Edward
  • 6,964
  • 2
  • 29
  • 55
1

I don't know what languages you're planning on supporting, but there are languages where your algorithm doesn't apply, eg. Japanese. I suggest checking out the word iterators in International Components for Unicode. http://userguide.icu-project.org/boundaryanalysis

Brent
  • 4,153
  • 4
  • 30
  • 63