g++ unicode character ifstream

Question

this is a question about unicode characters in a text input file. This discussion was close but not quite the answer. Compiled with VS2008 and executed on Windows these charcters are recognized on read (maybe represented as a different symbol but read) - compiled with g++and executed on linux they are displayed as blank.

‚ ƒ „ … † ‡ ˆ ‰ Š ‹ Œ Ž ‘ ’ “ ” • – — ˜ ™ š › œ ž Ÿ

The rest of the Unicode symbols appear to work fine, I did not check them all but found this set did not work.

Questions: (1) Why? (2) Is there a solution?

void Lexicon::buildMapFromFile(string filename )  //map
{
    ifstream file;
    file.open(filename.c_str(), ifstream::binary);
    string wow, mem, key;
    unsigned int x = 0;

    while(true) {
        getline(file, wow);
        cout << wow << endl;
        if (file.fail()) break; //boilerplate check for error
        while (x < wow.length() ) {
            if (wow[x] == ',') { //look for csv deliniator
                key = mem;
                mem.clear();
                x++; //step over ','
            } else 
                mem += wow[x++];
        }

        //cout << mem << " code " << key << " is " << (key[0] - '€') << " from €" << endl;

        cout << "enter 1 to continue: ";
        while (true) {
            int choice = GetInteger();
            if (choice == 1) break;
        }

        list_map0[key] = mem; //char to string
        list_map1[mem] = key; //string to char
        mem.clear(); //reset memory
        x = 0;//reset index
    }
    //printf("%d\n", list_map0.size());
    file.close();
}

The unicode symbols are read from a csv file and parsed for the unicode symbol and an associated string. Initially I though there was a bug in the code but in this post the review found it is fine and I followed the issue to how the characters are handled.

The test is cout << wow << endl;

"no code to post" ... "compiled with g++". Please get rid of the contradiction. — n. m. could be an AI, Jan 14 '13 at 18:40
ok, in context there is no reason to post the code, while technically you are correct that there is code to post - sorry for my poor grammar semantics — forest.peterson, Jan 14 '13 at 18:43
I knew someone would ask what the question is - sorry. And yes, #Bo Persson it is windows and linux — forest.peterson, Jan 14 '13 at 19:19

bames53 · Accepted Answer · 2013-01-14T19:51:25.860

The characters you show are all characters from Windows codepage 1252 which do not exist in the ISO-8859 1 encoding. These two encodings are similar and so are often confused.

If the input is CP1252 and you are reading it as though it were ISO-8859 1 then those characters are read as control characters and will not behave as normal, visible characters.

There are many possible things you could be doing incorrectly to cause this, but you'll have to post more details in order to determine which. A more complete answer requires knowing how you are reading the data, how you are converting and storing it internally, how you are testing the read data, and the input data and/or encoding.

Your displayed code doesn't do any conversions while reading the data, and the commented-out code to print the data is the same; no conversions. This means to print the data you are simply relying on the input data to be correct for the platform you run the program on. That means that, for example, if you run your program in the console on Windows then your input file needs to be encoded using the console's codepage*.

To fix the problem you can either; ensure the input file matches the encoding required by the particular console you run the program on; or specify the input encoding, convert to a known internal encoding when reading and then convert to the console encoding when printing.

_{* and if it's not, for example if the console is cp437 and the file is cp1252 then the characters you listed will instead show up as: É æ Æ ô ö ò û ù ÿ Ö Ü ¢ £ ¥ ₧ ƒ á í ó ú ñ Ñ ª º ¿ ⌐ ¬ ½ ¼ ¡ « »}

#bames53 I used the unicode chars to represent string tokens that ontologically represent actions, see ICD9 & masterformat. I use this internally and on read convert from ontology string to uniformat char and on write convert back from unicode c to tharhe ontology string. A single char simplified a lot. The software developer here is amused by my grad student solution but told me to use an int vector with ints rather than unicode chars to achieve the same function and improved behavior. But, I want to understand why it did not work since in concept it works and actually worked in win. — forest.peterson, Jan 14 '13 at 20:03
#bames53 thank you, after reviewing the tables on the ISO8859-1 and Windows-1252 wiki pages, with your explanation it is clear now. Since you know this topic, I had issues in win with À Á Â Ã showing up as Á (same issue for subsequent char letter groups) - I have not checked for the same issue in linux. — forest.peterson, Jan 14 '13 at 20:49

score 0 · Answer 2 · answered Jan 14 '13 at 18:54

0

Your problem-statement does not detail out thee platform for g++, but from your tags it appears that you are compiling that same code on linux.

Windows and linux are both unicode enabled. so, if your code in windows/vs-2008 had wstring class, then, you have to change it back to string on linux/g++. If you are using wstring in linux, it will not work the same way.

answered Jan 14 '13 at 18:54

#Sandeep platforms are windows and linux - I am not sure what you mean about he wstring class, I used #include – forest.peterson Jan 14 '13 at 19:51

score 0 · Answer 3 · answered Jan 14 '13 at 18:59

Unicode handling in C++ code is not straightforward and it is implementation-depending (you have already seen that the output change between VS2008 and g++). Furthermore, Unicode can be implemented by different character encodings (like UTF-8 and UTF-16).

There is an enlightening article in this page. It talks about Unicode conversion for STL-based software. For text i/o the main weapon is codecvt, a library function that can be used for translating strings between different character encoding systems.

#Davide Aversa I read the link you gave but 1) the implementation I am using is much simpler, and 2) my knowledge is lower than that article — forest.peterson, Jan 14 '13 at 19:49

g++ unicode character ifstream

3 Answers3