3

I am having an issue with "umlauts" (letters ä, ü, ö, ...) and ifstream in C++.

I use curl to download an html page and ifstream to read in the downloaded file line by line and parse some data out of it. This goes well until I have a line like one of the following:

te="Olimpija Laibach - Tromsö";
te="Burghausen - Münster";

My code parses these lines and outputs it as the following:

Olimpija Laibach vs. Troms?
Burghausen vs. M?nster

Things like outputting umlauts directly from the code work:

cout << "öäü" << endl; // This works fine

My code looks somewhat like this:

ifstream fin("file");

while(!(fin.eof())) {
    getline(fin, line, '\n');
    int pos = line.find("te=");
    if(pos >= 0) {
         pos = line.find(" - ");
         string team1 = line.substr(4,pos-4);
         string team2 = line.substr(pos+3, line.length()-pos-6);
         cout << team1 << " vs. " << team2 << endl;
   }
}

Edit: The weird thing is that the same code (the only changed things are the source and the delimiters) works for another text input file (same procedure: download with curl, read with ifstream). Parsing and outputting a line like the following is no problem:

<span id="...">Fernwärme Vienna</span>
mike
  • 880
  • 1
  • 8
  • 12
  • Once you know what the encoding of the input is, some of the examples at cppreference may help, e.g. [here](http://en.cppreference.com/w/cpp/locale/codecvt#Example) – jogojapan Jul 23 '12 at 08:25
  • possible duplicate of [does (w)ifstream support different encodings](http://stackoverflow.com/questions/1274910/does-wifstream-support-different-encodings) – jogojapan Jul 23 '12 at 08:26
  • I just edited and extended my question. I don't understand why the (nearly) same code is working with another input. – mike Jul 23 '12 at 08:44
  • Usually ``std::cout << "öäü" << std::endl;`` also does not work. – Giriraj Pawar Sep 07 '21 at 11:50

1 Answers1

2

What's the locale embedded in fin? In the code you show, it would be the global locale, which if you haven't reset it, is "C".

If you're anywhere outside the Anglo-Saxon world—and the strings you show suggest that you are— one of the first things you do in main should be

std::locale::global( std::locale( "" ) );

This sets the global locale (and thus the default locale for any streams opened later) to the locale being using in the surrounding environment. (Formally, to an implementation defined native environment, but in practice, to whatever the user is using.) In "C" locale, the encoding is almost always ASCII; ASCII doesn't recognize Umlauts, and according to the standard, illegal encodings in input should be replaces with an implementation defined character (IIRC—it's been some time since I've actually reread this section). In output, of course, you're not supposed to have any unknown characters, so the implementation doesn't check for them, and the go through.

Since std::cin, etc. are opened before you have a chance to set the global locale, you'll have to imbue them with std::locale( "" ) specifically.

If this doesn't work, you might have to find some specific locale to use.

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • 1
    Figuring the encoding of HTML is non-trivial. (in the best case, finding a line like ``) Using the users' locale is only a slightly better guess. – MSalters Jul 23 '12 at 08:51
  • Unfortunately this did not help. Included `std::locale::global( std::locale( "de_DE.UTF-8" ) );` as the first line in `main` but the output stays the same. Worth to mention that I am using an Amazon EC2 instance in the US to compile and run the code. – mike Jul 23 '12 at 08:56
  • @mike: Is `UTF-8` actually the input encoding? (It could be ISO-8859-1 or ISO-8859-15, or something completely different.) Is `de_DE.UTF-8` actually supported on the system you're using? – DevSolar Jul 23 '12 at 09:09
  • found the following line in the html of the page that is not working for me: ``. Changed locale to `std::locale::global( std::locale( "de_DE.iso88591" ) );` but the problem stays the same. No difference with `std::locale::global( std::locale( "de_DE.iso885915@euro" ) );` either. – mike Jul 23 '12 at 09:36
  • @MSalters If you're reading HTML, then the header should contain an indication of the encoding, and you can `imbue` the corresponding locale. – James Kanze Jul 23 '12 at 12:51
  • @mike A quick check would be to dump the string you're reading, in hex, to see what it contains. But if you `imbue` the input with the correct locale, there should be no problem. (For that matter, although arguably incorrect, the implementations of "C" that I know treat input transparently, and there's a lot of code that would break if they stopped doing so.) – James Kanze Jul 23 '12 at 12:53
  • @JamesKanze I tried the following (and other locales): `locale mylocale("de_DE.iso88591");` `fin.imbue(mylocale);` Which unfortunately did not solve me problem. The dump for "Tromsö" from the text file is `54726F6D736`. – mike Jul 23 '12 at 17:01
  • @mike That can't be correct for the dump, because it contains an odd number of hex digits: any dump must contain an even number. Otherwise, if the locale you specify is present, and the encoding of that locale corresponds to the encoding of the input, it should work. – James Kanze Jul 29 '12 at 15:16
  • Just made a simple test using Firefox. I openend the website and played with different encodings via view - enconding. The text is displayed correct with ISO-8859-1 and ISO-8859-15, but not with UTF-8. So it should with work with one of the two encodings. Is my way with `locale mylocale("de_DE.iso88591"); ifstream fin("...."); fin.imbue(mylocale);` maybe wrong? Should the locale be imbued somewhere else? – mike Jul 30 '12 at 09:28
  • The dump for the hardcoded string "Tromsö" is `54726F6D73s3m6`, the dump for the same string from the html page is `54726F6D736`. Getting headaches ... ;) – mike Jul 30 '12 at 09:38
  • @mike If the locale is imbued before the first input, it should work. (There are cases where it will also work after the first input.) And there seems to be some problem with the dump of the hard-coded string: a hex dump cannot contain the characters s or m. – James Kanze Jul 30 '12 at 15:06