1

I'm trying to read Alt key symbols from one Unicode UTF-8 file, and write to another.

Input file looks like this>

ỊịỌọỤụṄṅ

Output file looks like this>

239 187 191 225 187 138 225 187 139 225 187 140 225 187 141 225 187 164 225 187 165 225 185 132 225 185 133 ('\n' after each 3 digit combination, instead of ' ')

Code:

#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <Windows.h>


///convert as ANSI - display as Unicode
std::wstring test1(const char* filenamein)
{
    std::wifstream fs(filenamein);
    if(!fs.good()) 
    { 
        std::cout << "cannot open input file [" << filenamein << "]\n" << std::endl;  

        return NULL; 
    }

    wchar_t c; 
    std::wstring s;

    while(fs.get(c)) 
    { 
        s.push_back(c); 
        std::cout << '.' << std::flush; 
    }

    return s;

}

int printToFile(const char* filenameout, std::wstring line)
{
    std::wofstream fs;

    fs.open(filenameout);

    if(!fs.is_open())
        return -1;

    for(unsigned i = 0; i < line.length(); i++)
    {
        if(line[i] <= 126)  //if its standard letter just print to file
            fs << (char)line[i];
        else  //otherwise do this.
        {
            std::wstring write = L"";

            std::wostringstream ss;
            ss << (int)line[i];

            write = ss.str();

            fs << write;
            fs << std::endl;
        }
    }

    fs << std::endl;


    //2nd test, also fails
    char const *special_character[] = { "\u2780", "\u2781", "\u2782",
  "\u2783", "\u2784", "\u2785", "\u2786", "\u2787", "\u2788", "\u2789" };

    //prints out four '?'
    fs << special_character[0] << std::endl;
    fs << special_character[1] << std::endl;
    fs << special_character[2] << std::endl;
    fs << special_character[3] << std::endl;

    fs.close();

    return 1;
}

int main(int argc, char* argv[])
{
    std::wstring line = test1(argv[1]);

    if(printToFile(argv[2], line) == 1)
        std::cout << "Writing success!" << std::endl;
    else std::cout << "Writing failed!" << std::endl;



    return 0;
}

What I was expecting was something similar to the values in this table:

http://tools.oratory.com/altcodes.html

Rorschach
  • 734
  • 2
  • 7
  • 22
  • The referenced page is charset cp850. Unix like systems (Linux or BSD) often use latin1 or utf-8, Windows uses cp1252 which is very close to latin1. CP850 is the so called OEM charset (multilingual for western europe) in Windows and is the default charset in console windows (`cmd.exe`) – Serge Ballesta Mar 08 '16 at 13:03
  • I was hoping to get similar results to ASCII table values when reading 'regular' characters. Something like this [link](http://www.ascii.cl/htmlcodes.htm) But I guess I was wrong. – Rorschach Mar 08 '16 at 15:32
  • This table is about full unicode code. But Unicode is not by itself an encoding, Concrete encodings are UTF8 (mainly on Unix like world) or UTF16 (mainly on Windows). You **must** know the actual encoding of a file to decode the characters. If you don't, you will have bytes but hard to know what character is `0xe8`... But almost all common charset agree on codes below 127, what is the good old ASCII. For example A is always 0x41. – Serge Ballesta Mar 08 '16 at 16:03
  • I wrote my input file using Notepad++. Notepad++ gives following information: Length 24 (makes some sense, because there are 8 letters. UTF-8 (Set in 'Encoding' tab). I thought that that means my encoding is set to UTF-8 definitely. – Rorschach Mar 08 '16 at 16:21

1 Answers1

2

Ok, per your code and comments, I understand the following:

  • you have an input file that contains an UTF-8 encoded string
  • you are reading it on Windows into wide characters but without imbuing any locale

So here is what actually happens:

Your code correctly reads the file one byte at a time, as an ANSI file (as if it was win1252 encoded). Your program then display the code value of all the bytes. I can confirm that the list of bytes you show in your post is the utf-8 encode string ỊịỌọỤụṄṅ, except that notepad++ has added a Byte Order Mark (U+FEFF) at the start which is not normally used in UTF8 files - the BOM is the 3 bytes 239 187 191 (in decimal) or 0xef 0xbb 0xbf (in hexa)

So what could you do?

One simple solution (as you are using Windows) would be to ask notepad++ to encode the file as UTF16LE which is the native unicode format in Windows. That way you would actually read the unicode characters.

The other way would be to instruct your code to process the file as UTF8. That would be trivial on Linux, but can be tricky on Windows where UTF8 in only correctly processed since VC2010. This other post from SO shows how to imbue a UTF8 locale in a C++ stream.

I'm sorry for not giving code, but I have only an old VC2008 that does not support UTF8 streams... and I hate giving untested code.

Community
  • 1
  • 1
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • Hi Serge, not until now I had a chance to test what you suggested. I tried the answer from the link you posted and it worked great. I get 4 digit outputs, just like I was hoping to. My output now is: `7882 7883 7884 7885 7908 7909 7748 7749`. Which is fantastic! Thanks a lot! – Rorschach Mar 09 '16 at 12:42