Reading in Russian characters (Unicode) using a basic_ifstream

Question

Is this even possible? I've been trying to read a simple file that contains Russian, and it's clearly not working.

I've called file.imbue(loc) (and at this point, loc is correct, Russian_Russia.1251). And buf is of type basic_string<wchar_t>

The reason I'm using basic_ifstream<wchar_t> is because this is a template (so technically, basic_ifstream<T>, but in this case, T=wchar_t).

This all works perfectly with english characters...

while (file >> ch)
{
    if(isalnum(ch, loc))
    {
        buf += ch;
    }
    else if(!buf.empty())
    {
        // Do stuff with buf.
        buf.clear();
    }
}

I don't see why I'm getting garbage when reading Russian characters. (for example, if the file contains хеы хеы хеы, I get "яюE", 5(square), K(square), etc...

Oh the lovely problematic streams in C++ :) Maybe this can give you a hint: http://stackoverflow.com/questions/1509277/why-does-wide-file-stream-in-c-narrow-written-data-by-default — Khaled Alshaya, Mar 17 '10 at 17:01
So there really isn't a way that will allow use of templated streams? This seems far too complicated the way I'm looking at it. There is no way to have a stream read a particular kind of character at all? — Mark, Mar 17 '10 at 17:12
Firstly, "хеы хеы хеы" is definitely not Russian (although having Russian chars in it). Then, could you make make your example "complete" and provide a link to a sample file (in this case I'll be glad to try helping you). — mlvljr, Mar 21 '10 at 11:37

score 1 · Answer 1 · answered Mar 17 '10 at 17:09

1

Code page 1251 isn't for Unicode -- if memory serves, it's for 8859-5. Unfortunately, chances are that your iostream implementation doesn't support UTF-16 "out of the box." This is a bit strange, since doing so would just involve passing the data through un-changed, but most still don't support it. For what it's worth, at least if I recall correctly, C++ 0x is supposed to add this.

answered Mar 17 '10 at 17:09

Jerry Coffin

476,176
80
629
1,111

So, std::basic_ifstream just cannot be done? Then why does it exist? Forgive the nature of my questions, I just don't see a way, at all, to read multibyte characters using streams, and have them be anything but garbage as soon as they're read, unless you write code specifically for each kind of multibyte encoding - which defeats the point of templates altogether. – Mark Mar 17 '10 at 17:19
@Mark: The important point here is that your input isn't Unicode. Is your implementation expecting Unicode? – David Thornley Mar 17 '10 at 17:45
I'm not really sure what you mean - all I know is that the file will be in either ASCII or Unicode (and it's supposed to be selectable at compile time whether or not to use wide or narrow characters - using a template). – Mark Mar 17 '10 at 18:00
basic_[io]stream can be done, but most implementations assume the external encoding will be something like ISO 8859-x or shift JIS rather than Unicode. Though they didn't really plan it that way, it's possible to make them read/write files in UTF-8 encoded Unicode. Getting it to work with UTF-16 or UTF-32/UCS-4 would be more difficult. Given that you're doing different transformations with each, at some point you need unique code for each encoding. The template reduces unnecessary duplication elsewhere. – Jerry Coffin Mar 17 '10 at 18:17

score 1 · Answer 2 · answered Mar 17 '10 at 17:42

There are still lots of STL implementations that don't have a std::codecvt that can handle Unicode encodings. Their wchar_t templated streams will default to the system code page, even though they are otherwise Unicode enabled for, say, the filename. If the file actually contains UTF-8, they'll produce junk. Maybe this will help.

score 0 · Answer 3 · answered Mar 17 '10 at 17:41

0

Iostreams, by default, assumes any data on disk is in a non-unicode format, for compatibility with existing programs that do not handle unicode. C++0x will fix this by allowing native unicode support, but at this time there is a std::codecvt<wchar_t, char, mbstate_t> used by iostreams to convert the normal char data into wide characters for you. See cplusplus.com's description of std::codecvt.

If you want to use unicode with iostreams, you need to specify a codecvt facet with the form std::codecvt<wchar_t, wchar_t, mbstate_t>, which just passes through data unchanged.

answered Mar 17 '10 at 17:41

Billy ONeal

104,103
58
317
552

You just pass the facet to basic_istream::use_facet, like you would with any other facet. – Billy ONeal Mar 17 '10 at 19:16
I'm not sure that exists... Maybe I'm misunderstanding how facets work, but I don't see how you could pass one to use_facet, since I don't think use_facet is defined for basic_ifstream. I could be wrong... – Mark Mar 18 '10 at 05:33
Sorry -- I'm not very familiar with this stuff :( I think the method you're looking for is `std::basic_ifstream::imbue`. – Billy ONeal Mar 18 '10 at 12:35

score 0 · Answer 4 · answered Mar 17 '10 at 18:42

0

I am not sure, but you can try to call setlocale(LC_CTYPE, "");

answered Mar 17 '10 at 18:42

VitalyVal

1,320
12
13

Err.. no, that's the default locale in any case. – Billy ONeal Mar 17 '10 at 19:18

Reading in Russian characters (Unicode) using a basic_ifstream

4 Answers4

Linked