Characters not recognized while reading from file

Question

I have the following c++ code in visual studio to read characters from a file.

    ifstream infile;
    infile.open(argv[1]);

    if (infile.fail()) {
        cout << "Error reading from file: " << strerror(errno) << endl;
        cout << argv[0] << endl;
    }
    else {
        char currentChar;

        while (infile.get(currentChar)) {
            cout << currentChar << " " << int(currentChar) << endl;
            //... do something with currentChar
        }

        ofstream outfile("output.txt");
        outfile << /* output some text based on currentChar */;
    }
    infile.close();

The file in this case is expected to contain mostly normal ASCII characters, with the exception of two: “ and ”.

The problem is that the code in it's current form is not able to recognise those characters. couting the character outputs garbage, and its int conversion yields a negative number that's different depending on where in the file it occurs.

I have a hunch that the problem is encoding, so I've tried to imbue infile based on some examples on the internet, but I haven't seemed to get it right. infile.get either fails when reaching the quote character, or the problem remains. What details am I missing?

Try `(int)(unsigned char)currentChar`. For conversion from `char` to `int`, sign extension happens if `char` is signed by default (what seems obviously be the case for your compiler) which is in this case undesired. The intermediate conversion over `unsigned char` can prevent this. — Scheff's Cat, Feb 26 '18 at 09:31
Alternatively, you could use `int currentChar;` and [`std::istream::get()`](http://en.cppreference.com/w/cpp/io/basic_istream/get) which would return valid characters in the range of [1, 255]. Read failure would be signalled by return < 0, in this case. — Scheff's Cat, Feb 26 '18 at 09:37
What is the encoding of the file? Exactly what bytes are in the file? — Martin Bonner supports Monica, Feb 26 '18 at 09:45
@MartinBonner according to the file command, of the input files I have, one is `Non-ISO extended-ASCII text, with CRLF line terminators`, and the other `UTF-8 Unicode text, with no line terminators`. The characters supported are all ascii characters, and `“` and `”` — Gerome Schutte, Feb 26 '18 at 09:48
In addition to @MartinBonner - Typographic quotes are not part of [ISO 8815-1](https://de.wikipedia.org/wiki/ISO_8859-1) nor [ISO 8859-15](https://de.wikipedia.org/wiki/ISO_8859-15). (I suspected from your name that these could be used.) Hence, you probably use [UTF-8](https://de.wikipedia.org/wiki/UTF-8). In UTF-8, typographic quotes are stored with three bytes. — Scheff's Cat, Feb 26 '18 at 09:50
[UTF-8 encoding](http://www.utf8-zeichentabelle.de/unicode-utf8-table.pl): LEFT DOUBLE QUOTATION MARK: U+2018 as UTF-8: `"\xe2\x80\x9c"`. The others have values close to this. — Scheff's Cat, Feb 26 '18 at 09:55
"The characters supported are all ascii characters, and `“` and `”`". ASCII (from 0 ... 127) is a real subset of UTF-8. (This is how UTF-8 was defined by intention.) So, your file is probably encoded in UTF-8. — Scheff's Cat, Feb 26 '18 at 09:59
@Scheff. Typographic quotes are not part of ISO 8815-1 or ISO 8859-15, but they *are* part of Windows 1252. We *really* need to see the actual octets of the data. OP: If you have `file`, I presume you have Posix tools available. What does `od -t x1z` say? (single byte hex, with character display). I'm really interested in one of the typographic quotes. — Martin Bonner supports Monica, Feb 26 '18 at 10:43
@MartinBonner "Typographic quotes are not part of ISO 8815-1 or ISO 8859-15, but they are part of Windows 1252." Damn. Not carefully enough researched... In (my) perfect world, there were only UTF and Unicode. All the other encodings are old and annoying. — Scheff's Cat, Feb 26 '18 at 10:46
@MartinBonner for the file content `“cos341asdas”`, `od -t x1z` generates: `0000000 e2 80 9c 63 6f 73 33 34 31 61 73 64 61 73 e2 80 >...cos341asdas..< 0000020 9d >.< 0000021` — Gerome Schutte, Feb 26 '18 at 11:43
The `file` program guesses the character encoding. It could always guess a many but it always guesses just one. You shouldn't have to guess. You should know. If users are providing the text files, you just have to tell them is use a certain encoding or require them to tell you the encoding they are using. BTW—ASCII is used every day in certain contexts but it is not at all "normal". — Tom Blodget, Feb 26 '18 at 18:16

score 2 · Accepted Answer · answered Feb 26 '18 at 12:14

The file you are trying to read is likely UTF-8 encoded. The reason most characters read fine is because UTF-8 is backwards compatible with ASCII.

In order to read a UTF-8 file I'll refer you to this: http://en.cppreference.com/w/cpp/locale/codecvt_utf8

#include <fstream>
#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
...

// Write file in UTF-8
std::wofstream wof;
wof.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t,0x10ffff,std::generate_header>));
wof.open(L"file.txt");
wof << L"This is a test.";
wof << L"This is another test.";
wof << L"\nThis is the final test.\n";
wof.close();

// Read file in UTF-8
std::wifstream wif(L"file.txt");
wif.imbue(std::locale(std::locale::empty(), new std::codecvt_utf8<wchar_t,0x10ffff, std::consume_header>));

std::wstringstream wss;
wss << wif.rdbuf();

(from here)

Perfect! Thanks. It's worth noting that wcout doesn't follow this fix. [source](http://www.cplusplus.com/forum/beginner/126557/#msg685276) — Gerome Schutte, Feb 26 '18 at 12:52

AhmadWabbi · Answer 2 · 2018-02-26T09:39:21.193

-2

try:

 while (infile.get(&currentChar, 1))

Also, be sure that you pass argv[1]. Print its value:

cout<<argv[1]<<endl;

edited Feb 26 '18 at 09:39

answered Feb 26 '18 at 09:34

AhmadWabbi

2,253
1
20
35

1

This too just changes which overload of `get` is called. It doesn't change the OP's problem which is how to read non-ASCII. – Martin Bonner supports Monica Feb 26 '18 at 09:44

Characters not recognized while reading from file

2 Answers2

Linked