Handling Non-Ascii Chars in C++

Question

I am facing some issues with non-Ascii chars in C++. I have one file containg non-ascii chars which I am reading in C++ via file Handling. After reading the file(say 1.txt) I am storing the data into string stream and writing it into another file(say 2.txt).

Assume 1.txt contains:

ação

In 2.txt I should get same ouyput but non-Ascii chars are printed as their Hex value in 2.txt.

Also, I am quite sure that C++ is handling Ascii chars as Ascii only.

Please Help on how to print these chars correctly in 2.txt

EDIT:

Firstly Psuedo-Code for Whole Process:

1.Shell script to Read from DB one Value and stores in 11.txt
2.CPP Code(a.cpp) reading 11.txt and Writing to f.txt

Data Present in DB which is being read: Instalação

File 11.txt contains: InstalaÃ§Ã£o

File F.txt Contains: InstalaÃ§Ã£o

Ouput of a.cpp on screen: Instalação

a.cpp

#include <iterator>
#include <iostream>
#include <algorithm>
#include <sstream>
#include<fstream>
#include <iomanip>

using namespace std;
int main()
{
    ifstream myReadFile;
    ofstream f2;
    myReadFile.open("11.txt");
    f2.open("f2.txt");
    string output;
    if (myReadFile.is_open()) 
    {
        while (!myReadFile.eof())
        {
            myReadFile >> output;
                //cout<<output;

            cout<<"\n";

            std::stringstream tempDummyLineItem;
            tempDummyLineItem <<output;
            cout<<tempDummyLineItem.str();
            f2<<tempDummyLineItem.str();
        }
    }
    myReadFile.close();
    return 0;
}

Locale says this:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

So exactly what is your question? "How do I identify ASCII characters, and print non-ASCII as hex?" — Mats Petersson, Jul 15 '13 at 07:37
Post your actual code (the smallest sample that exhibits your problem) and then we can tell you what minimal changes have to be made. — chris, Jul 15 '13 at 07:38
I want to get non-ascii chars printed as non-ascii only in 2.txt and not as their hex values — Mayank Jain, Jul 15 '13 at 07:38
@chris - Sorry but I can't post actual C++ code due to copyright issues. — Mayank Jain, Jul 15 '13 at 07:39
@MayankJain, The posted code should be about the same length as that pseudocode. There's no way that an [SSCCE](http://sscce.org) of this could be copyrighted. — chris, Jul 15 '13 at 07:40
@MayankJain Don't forget to indicate how all of the variables are declared (e.g. `int`, `char`, etc.). And you might also indicate how the "non-Ascii" characters are encoded (Latin 1, UTF-8, etc.). — James Kanze, Jul 15 '13 at 07:45
The problem is how to identify the input text coding. It could be UTF-8, but it could also be UTF-16, and in that case the way to interpret chars completely changes. Have a look at this: http://www.codinghorror.com/blog/2005/01/there-aint-no-such-thing-as-plain-text.html — Baltasarq, Jul 15 '13 at 08:38
We still have no idea about the encoding of your files? What is it? utf8? system locale? — Twifty, Jul 15 '13 at 08:40
@MayankJain I don't think you understand the meaning of file encoding. Characters can be written to a file in many ways, the most common being utf8 followed by utf16 big/little endian. Then you have all the system locale multibyte encodings. To answer your question, we need to know the encoding of the file. Try opening it in a text editor and look for something that may tell you how it is encoded. If you wrote it yourself, but don't know the encoding, tell us how you wrote the file. — Twifty, Jul 15 '13 at 08:50
Sorry..I don't this concepts thoroughly. However, file -i in Unix gives me this inof...text/plain; charset=iso-8859-1...File is Being created by C++ File Handling Functions...Hope this is what you are searching for.. — Mayank Jain, Jul 15 '13 at 09:04
@MayankJain ok, getting somewhere. Look here http://en.wikipedia.org/wiki/ISO/IEC_8859-1 — Twifty, Jul 15 '13 at 09:07
@MayankJain Heres your answer http://stackoverflow.com/questions/11608790/c-ifstream-and-umlauts — Twifty, Jul 15 '13 at 09:11

score 3 · Answer 1 · answered Jul 15 '13 at 08:04

At least if I understand what you're after, I'd do something like this:

#include <iterator>
#include <iostream>
#include <algorithm>
#include <sstream>
#include <iomanip>

std::string to_hex(char ch) {
    std::ostringstream b;
    b << "\\x" << std::setfill('0') << std::setw(2) << std::setprecision(2)
        << std::hex << static_cast<unsigned int>(ch & 0xff);
    return b.str();
}

int main(){
    // for test purposes, we'll use a stringstream for input
    std::stringstream infile("normal stuff. weird stuff:\x01\xee:back to normal");

    infile << std::noskipws;

    // copy input to output, converting non-ASCII to hex:
    std::transform(std::istream_iterator<char>(infile),
        std::istream_iterator<char>(),
        std::ostream_iterator<std::string>(std::cout),
        [](char ch) {
            return (ch >= ' ') && (ch < 127) ?
                std::string(1, ch) :
                to_hex(ch);
    });
}

Twifty · Answer 2 · 2013-07-15T09:39:51.777

Sounds to me like a utf8 issue. Since you didn't tag your question with c++11 Here Is an excelent article on unicode and c++ streams.

From your updated code, let me explain what is happening. You create a file stream to read your file. Internally the file stream only recognizes chars, until you tell it otherwise. A char, on most machines, can only hold 8 bits of data, but the characters in your file are using more than 8 bits. To be able to read your file correctly, you NEED to know how it is encoded. The most common encoding is UTF-8, which uses between 1 and 4 chars for each character.

Once you know your encoding, you can either use wifstream (for UTF-16) or imbue() a locale for other encodings.

Update: If your file is ISO-88591 (from your comment above), try this.

wifstream myReadFile;
myReadFile.imbue(std::locale("en_US.iso88591"));
myReadFile.open("11.txt");

Handling Non-Ascii Chars in C++

2 Answers2

Linked