1

As a exercise, I am making a simple vocabulary trainer. The file I am reading contains the vocabulary, which also includes special characters like äöü for example.

I have been struggling to read this file, however, without getting mangled characters instead of the approperate special characters.

I understand why this is happening but not how to correctly solve it.

Here is my attempt:

Unit(const char* file)
:unitName(getFileName(file),false){
    std::wifstream infile(file);
    std::wstring  line;
    infile.imbue(std::locale(infile.getloc(), new std::codecvt_utf8<wchar_t, 0x10ffff, std::consume_header>()));
    while (std::getline(infile, line))
    {

        std::wcout<<line.c_str()<<"\n";
        this->vocabulary.insert(parseLine(line.c_str(),Language::EN_UK,Language::DE));
    }

}

The reading process stops as soon as a entry is reached that contains a special character.

I have even been able to change the code slightly to see where exactly it stops reading:

while (infile.eof()==false)
{
    std::getline(infile, line);
    std::wcout<<line.c_str()<<"\n";
    this->vocabulary.insert(parseLine(line.c_str(),Language::EN_UK,Language::DE));
}

If I do it like this, the output loops the entry with the special character but stops outputting it right before the special character would appear like so:

Instead of:
cross-class|klassenübergreifend 
It says:
cross-class|klassen
cross-class|klassen
cross-class|klassen
cross-class|klassen
.
.
.

this leads me to believe that the special character gets misinterpreted as a line end by getline.

I do not care if I have to use getline or something else, but in order for my parse function to work, the string it gets needs to represent a line in the file. Therefore reading the entire buffer into a string wont work, unless I do the seperation myself.

How can I properly and neatly read a utf-8 file line by line?


Note: I looked for other articles on here but most of them either use getline or just explain why but not how to solve it.

MoustacheSpy
  • 743
  • 1
  • 6
  • 27
  • This is our pragmatic approach: The UTF-8 was invented with the intention that it can be processed with programs which understand the first 128 bits of ASCII and just ignore the rest. Hence, in our software we use always UTF-8 internally which is stored in common `std::string` (or `referred with `const char*`). As long as we do storage - no issues. For proper exchange with GUI (in our case Qt), we explicitly provide it as UTF-8 and request UTF-8. (Qt might convert it internally which is beyond our scope.) – Scheff's Cat Jun 10 '18 at 13:05
  • Once, we did portability wrappers for selected functions of the win32 API, we used our own functions to convert UTF-8 from/to UTF-16 (which was stored in `std::wstring`). Please, note that conversion UTF-8 from/to UTF-16 is a mere computation without the necessity to "know" any contents (as both are distinct encodings for Unicode). The win32 API provides any resp. function with a `W` suffix which is the multi-byte variant - effectively the UTF-16 variant. Btw. `äöüß` are our common test cases (or issues) as well... ;-) "klassenübergreifend"... :-) – Scheff's Cat Jun 10 '18 at 13:13
  • Applying this to the above comments: _How can I properly and neatly read a utf-8 file line by line?_: With `std::ifstream` and without consideration of any specific locale. (If in doubt - just use C locale.) – Scheff's Cat Jun 10 '18 at 13:20
  • 1
    Why are you using `wchar_t` with *UTF-8* data? – Nicol Bolas Jun 10 '18 at 13:37
  • @Scheff I am sorry I think I made a mistake with the title? Anyways I am simply trying to print the text with the special characters in the console thats all. I now realise that I have been doing UTF-16 thanks to your comments but I still dont know how I can get the special characters to display correctly when printing them to the command line (I am not using QT) – MoustacheSpy Jun 10 '18 at 13:45
  • Console output in Windows is a drama. As I told you, we are using in our S/W UTF-8 exclusively. Output for users is localized using translation tables (with English fall-back messages). Hence, sometimes these texts are printed on console. (As we have only few console applications but most with GUI, these are rarely dedicated for end users - but debug messages for developers and testers.) Correct console output of UTF-8 in Windows - no chance. I once tried `SetConsoleOutputCP(65001); // UTF-8` which I found "accidentally" in SO - it's a petty... – Scheff's Cat Jun 10 '18 at 15:57
  • ...still no `äöüß` (and typographical quotion marks, and greek symbols, and...) The only change was _how_ it looked wrong. After a recent update of Windows, suddenly the output stopped totally with the first non-ASCII-like sequence. Finally, I became tired of this and dropped the topic i.e. I gave up. (It's not that bad as (our) users don't like the console.) ;-) You may google the topic and will get multiple hits in SO as this question pops periodically. Ah yepp, I forgot to mention that I don't expect issues in other OSes. AFAIK, Linux terminals (at least the xterm) does support UTF-8 fine. – Scheff's Cat Jun 10 '18 at 16:04
  • Just found another promising Q/A: [SO: Output unicode strings in Windows console app](https://stackoverflow.com/q/2492077/7478597). (Did I give up too early?) For UTF-8 to UTF-16 conversion, there is e.g. this: [SO: How to convert UTF-8 std::string to UTF-16 std::wstring?](https://stackoverflow.com/a/7154226/7478597). (In our S/W, we do it quite similar.) If you get a running example I would enjoy to read your self-answer. – Scheff's Cat Jun 10 '18 at 16:17

0 Answers0