0

I did some Googling around, but couldn't find a clear answer (not using the correct terminology perhaps?)

Anyway, I have some text files in ANSI format (WCP-1252) whose characters I want to process in a C++ program, but the thing is I don't know how to store the 2-byte characters that correspond to decimal codes 128 through to 255. Just to be sure though, I tried the following code:

ifstream infile("textfile.txt");
char c;
infile>>c;                           //also tried infile.get(c);  
cout<<c;

Unsurprisingly, the 1-byte char failed to store any symbol from the extended set after 0x7F (I think it just displayed the ASCII symbol corresponding to the value of the first byte and discarded the second or vice verse).

Thomas Dickey
  • 51,086
  • 7
  • 70
  • 105
Ali250
  • 652
  • 1
  • 5
  • 19
  • *Reading* the chars is not actually the problem, but you are converting them to something else (and "2 bytes" suggests to Unicode). You are correct: you cannot store a Unicode character into a simple char. Use `wchar` instead. However, `cout` failing on high ASCII chars is another (unrelated) issue. – Jongware Dec 07 '13 at 11:22
  • I though infile>>c; would catch non printables, but printing them out is a different matter. try `cout< – RichardPlunkett Dec 07 '13 at 11:24
  • Are you sure you have two byte characters in the file? Unless I'm confused, WCP-1252 only has characters from 0-255 or single byte. – Retired Ninja Dec 07 '13 at 11:24
  • Hold on with the question editing for a moment. "the value of the first byte and discarded the second" -- impossible. Win-1252 specifies an ASCII codepage, all values *are* only 1 char wide. – Jongware Dec 07 '13 at 11:24
  • .. Paste a small fragment of this mystery text into your post. I'm betting it's UTF8. – Jongware Dec 07 '13 at 11:26
  • Please provide a complete minimal example. Windows codepage 1252 (Windows ANSI Western) is an extension of ISO Latin-1. All code points are 1 byte, so the information given about >1 byte must refer to something you haven't shown. – Cheers and hth. - Alf Dec 07 '13 at 11:48
  • Okay I'm really confused here. The reason I was calling it a "2 byte character" is because I thought that char is a signed byte and that since ASCII runs from 0 to 127 and takes 1 byte, thus by extension storing a WCP-1252 character within the range 128-255 would take two bytes. Though I guess I'm wrong here? For now I'm testing with a simple txt file with just the character **é** in it (decimal 234), saved in ANSI format. When I print it using the code in the question, I get **8** as output from cout. – Ali250 Dec 07 '13 at 12:39
  • A single char can store 256 'values'; anything above 0x7F gets simply *interpreted* as a negative number. Look up "signed/unsigned characters". Anyhow: your test *ought* to have worked. Perhaps your console doesn't know how to display Windows-1252. What OS are you on? – Jongware Dec 07 '13 at 13:25
  • I'm using Windows 7 and Visual Studio 2012. After reading the answer below I wrote a brief test code: http://pastie.org/8535498 As before, my text file just has that one French character in it. But still I'm just getting an output **8**. What gives? – Ali250 Dec 07 '13 at 13:36

1 Answers1

0

WCP-1252 is represented in 8-bit but some chars are not part of ASCII. I suggest you write a conversion table from WCP-1252 to wchar_t. Read char by char and convert to wchar_t. You can write a map< uint8_t, wchar_t >. For example:

wchar_t WCP1252Towc( char ch )
{
    static map< char, wchar_t > table
    {

        {0x30, L'0' },
        {0x31, L'1' },
        // ..
        {0x39, L'9'},

        {0x40, L'A'},
        // ...
        {0x5A, L'Z'},

        {0x61, L'a'},
        // ...
        {0x7A, L'z'},

        // ...
    };

    return table[ ch ]; 
};  

wstring WCP1252sTowcs( string str )
{
    const auto len = str.size();
    wstring res( len, L'\0' );

    for( size_t i = 0; i < len; ++i )
        res[ i ] = WCP1252Towc( str[ i ] );

    return res;
}

ifstream infile("textfile.txt");
string line; getline( infile, line );
auto unicode = WCP1252sTowcs( line );
wcout << unicode;
RedX
  • 14,749
  • 1
  • 53
  • 76
Elvis Dukaj
  • 7,142
  • 12
  • 43
  • 85
  • Thanks. I'm making a complete map right now, but what's with the "L" in the map entries before the character? – Ali250 Dec 07 '13 at 12:49
  • because you some char WCP1252 are not rappresentable with char type. You need to use wchar_t so the `L` before the string tells the compiler that we are using a wchar_t and not a char – Elvis Dukaj Dec 07 '13 at 12:51
  • `wchar_t` is implementation-specific and _not_ guaranteed to be able to store unicode characters – Erbureth Dec 07 '13 at 13:04
  • Okay, I wrote a brief code based on yours: http://pastie.org/8535498 I'm using it to read a text file that just has the symbol **é** stored in it (hence only one entry in the map), but just like my first code in the question, it simply prints an **8** in the console. Am I doing something wrong or is my Visual Studio not using WCP-1252 or something? (I assumed it does that by default). – Ali250 Dec 07 '13 at 13:41
  • I discovered some interesting things: it's very hard for a console application to print non ASCII chars: looks this: http://stackoverflow.com/questions/15473051/reading-writing-printing-utf-8-in-c11?rq=1 – Elvis Dukaj Dec 08 '13 at 12:06