fstream not working properly with russian text?

Question

I work with russian a lot and I've been trying to get data from a file with an input stream. Here's the code, it's supposed to output only the words that contain no more than 5 characters.

#include <iostream>
#include <fstream>
#include <string>
#include <Windows.h>
using namespace std;
int main()
{
    setlocale(LC_ALL, "ru_ru.utf8");
    ifstream input{ "in_text.txt" };
    if (!input) {
        cerr << "Ошибка при открытии файла" << endl;
        return 1;
    }
    cout << "Вывод содержимого файла: " << "\n\n";
    string line{};
    while (input >> line) {
        if (line.size() <= 5)
            cout << line << endl;
    }
    cout << endl;

    input.close();
    return 0;
}

Here's the problem:

I noticed the output didn't pick up all of the words that were actually containing less than 5 characters. So I did a simple test with the word "Test" in english and the translation "тест" in russian, the same number of characters. So my text file would look like this:

Test тест

I used to debugger to see how the program would run and it printed out the english word and left the russian. I can't understand why this is happening.

P.S. When I changed the code to if (line.size() <= 8) it printed out both of them. Very odd

I think I messed up my system locale somehow I don't know. I did one time try to use std::locale without really understanding it, maybe that did something to my PC I'm not really sure. Please help

I think this could help you. https://learn.microsoft.com/en-us/windows/win32/learnwin32/working-with-strings Based on, that you working with windows and using special characters. — zerocukor287, Nov 08 '21 at 21:20
utf8 means that Russian glyphs are encoded by 2 or more bytes, but since string.size() counts bytes, not characters, you get the wrong count. — Michael Veksler, Nov 08 '21 at 21:21
@MichaelVeksler how do you recommend solving this problem? I've previously used setlocale(LC_ALL, "Russian") but it hasn't been working recently for some reason — Victor Ajayi, Nov 08 '21 at 21:26
You may have better success with wide characters, wcout, wstring, L"Ошибка при открытии файла", and so on. I don't have enough experience with non-ASCII text in c++, to recommend on best practice. I'd use the same code as you, only convert the string to wstring and test its size instead. See https://stackoverflow.com/questions/2573834/c-convert-string-or-char-to-wstring-or-wchar-t — Michael Veksler, Nov 08 '21 at 21:38
@user4581301 that is exactly the problem, a byte is a byte but Victor wants to count Russian unicode characters, which are encoded in more than one byte each — Michael Veksler, Nov 08 '21 at 21:41
This will not help for the problem at hand but I think it'd be better in general to do `std::locale::global(std::locale("ru_ru.utf8"));` instead of calling `setlocale`. — Ted Lyngmo, Nov 08 '21 at 21:46
@MichaelVeksler thanks a lot. I'll try it out and comment on the results later. I hope you don't mind if I tag you again — Victor Ajayi, Nov 08 '21 at 21:51
@TedLyngmo thanks. Could you please recommend a way to learn/understand `std::locale`, I've tried reading cppreference it was very difficult to comprehend — Victor Ajayi, Nov 08 '21 at 21:53
@VictorAjayi I haven't worked with locales much myself and also find the standard library messy unfortunately. If the file is really in utf8, it's pretty easy to decode and encode manually though. You could convert what you read into codepoints (and store them in a `std::u32string`). That way, counting letters becomes easy. There is also `codecvt_*` functions and also `std::c32rtomb` and `std::mbrtoc32` that may help. — Ted Lyngmo, Nov 08 '21 at 22:30
use `std::wifstream` and `_setmode(_fileno(stdout), _O_U16TEXT)` . Really — Алексей Неудачин, Nov 08 '21 at 23:04
Victor, did I understand it correctly? Is the file utf8 encoded and do you wish to print the utf8 encoded word in `line` if it's `<= 5` codepoints long? — Ted Lyngmo, Nov 09 '21 at 10:55
@TedLyngmo yes, the file is utf8 encoded. Basically, the problem is in trying to find the correct size of a string with Cyrillic characters. `"Test"` gives `4` with `size()` but `"тест"` gives `8`. — Victor Ajayi, Nov 09 '21 at 16:12
@VictorAjayi Did you try the code in my answer? Did you check the demo? — Ted Lyngmo, Nov 15 '21 at 13:34
@TedLyngmo yes it did! Both of them did, I'm sorry I forgot to say that. Thanks a lot. I'll be using your `std::locale` fix from now on. I may not really understand the utf-32 fix but it does work — Victor Ajayi, Nov 15 '21 at 15:06

Ted Lyngmo · Accepted Answer · 2021-11-09T23:35:22.973

I'm very unsure about this but using codecvt_utf8 and wstring_convert seems to work:

#include <codecvt>   // codecvt_utf8
#include <string>
#include <iostream>
#include <locale>    // std::wstring_convert

int main() {
    // ...

    while (input >> line) {
        // convert the utf8 encoded `line` to utf32 encoding:
        std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> u8_to_u32;
        std::u32string u32s = u8_to_u32.from_bytes(line);

        if (u32s.size() <= 5)           // check the utf32 length
            std::cout << line << '\n';  // but print the utf8 encoded string
    }

    // ...
}

Demo

fstream not working properly with russian text?

1 Answers1