10

I was trying to read a plain text file. In my case, I need to read line per line, and process that information. I know the C++ has wstuffs for reading wchars. I tried the following:

#include <fstream>
#include <iostream>

int main() {
    std::wfstream file("file");       // aaaàaaa
    std::wstring str;
    std::getline(file, str);
    std::wcout << str << std::endl;   // aaa
}

But as you can see, it did not read a full line. It stops when reads "à", which is non-ASCII. How can I fix it?

Mr.C64
  • 41,637
  • 14
  • 86
  • 162
Physicist137
  • 228
  • 2
  • 9
  • Related? http://stackoverflow.com/q/10504044/3747990 – Niall Aug 13 '14 at 14:23
  • "wchar" ≠ "non-ASCII", and you may be mistaking any old *8-bit* encoding for "ASCII". However, it's not the problem. This is probably a duplicate of http://stackoverflow.com/questions/4775437/read-unicode-file-into-wstring – Jongware Aug 13 '14 at 14:24
  • @Jongware, I'm looking for a Linux solution. This seems to work only on Windows. – Physicist137 Aug 13 '14 at 14:39
  • @Niall, Not related, nor duplicate. Those solutions seems to use codecvt, which doesn't work on gcc. – Physicist137 Aug 13 '14 at 14:40
  • 1
    @Physicist137, noted (I see the new tag). Wasn't marked as duplicate. In general though, the `wfstream` implementation is backed by a buffer that only reads `char` types, and any unicode requirements are then either read via `binary` access or some conversion for the locale. – Niall Aug 13 '14 at 14:47
  • @Mr.C64, I'm sorry. I hope your answer wasn't big. Still, thanks for your disposition to help me. – Physicist137 Aug 13 '14 at 15:14

1 Answers1

12

You will need to understand some basic concepts of encodings. I recommend reading this article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Basically you can't assume every byte is a letter and that every letter fits in a char. Also, the system must know how to extract letters from the sequence of bytes you have on the file.

Let's assume your file is encoded in UTF-8, this is likely given that you are on Linux. I'll assume your terminal also supports it. If you directly read using a std::string, with chars, you will have everything working. Look:

// olá
#include <iostream>
#include <fstream>
int main() {
    std::fstream file("test.cpp");
    std::string str;
    std::getline(file, str);
    std::cout << str << std::endl;
}

The output is what you expect, but this is not really correct. Look at what is going on: The file is encoded in utf-8. This means the first line is this byte sequence:

/  /     o   l       á
47 47 32 111 108 195 161

Note that á is encoded with two bytes. If you ask the size of the string (str.size()), you will indeed get the wrong value: 7. This happens because the string thinks every byte is a char. When you send it to std::cout, the string will be given to the terminal to print. And the magical part: The terminal works with utf-8 by default. So it just assumes the string is utf-8 and correctly prints 6 chars.

You see that it works, but it is not really right. Try to make any string operation on the data and you may break the utf-8 encoding and will never be able to print it again!

Let's go for wstrings. They store each letter with a wchar_t that, on Linux, has 4 bytes. This is enough to hold any possible unicode character. But it will not work directly because C++ by default uses the "C" locale. A locale is a specification of how to deal with various aspects of the system, like "how to print a date" or "how to format a currency value" or even "how to decode text". The last factor is important and the default "C" encoding says: "Assume everything is ASCII". When it is reading the file and tries to decode a non-ASCII byte, it just fails silently.

The correction is simple: Use a UTF-8 locale. Look:

// olá
#include <iostream>
#include <fstream>
#include <locale>

int main() {
    std::ios::sync_with_stdio(false);

    std::locale loc("en_US.UTF-8"); // You can also use "" for the default system locale
    std::wcout.imbue(loc); // Use it for output

    std::wfstream file("test.cpp");
    file.imbue(loc); // Use it for file input
    std::wstring str;
    std::getline(file, str); // str.size() will be 6
    std::wcout << str << std::endl;
}

You may be asking what std::ios::sync_with_stdio(false); means. It is required because by default C++ streams are kept in sync with C streams. This is good because enables you to use both cout and printf on the same program. We have to disable it because C streams will break the utf-8 encoding and will produce garbage on the output.

Guilherme Bernal
  • 8,183
  • 25
  • 43