How select the right codepage to decode the content encoded by CArchive

Question

In .net I want to decode some raw data encoded by a C++ application. C++ application is 32 bit and C# application is 64bit.

C++ application supports Russian and Spanish characters, but it doesn't support unicode characters. This C# binary reader fails to read Russian or spanish characters and works only for English ascii characters.

CArchive doesn't specify any encoding and I am not sure how to read it from C#.

I've tested this for couple of simple strings this is what C++ CArchive provides :

For "ABC" : "03 41 42 43"

For "ÁåëÀÇ 7555Â" : "0B C1 E5 EB C0 C7 20 37 35 35 35 C2"

The following shows how the C++ application write the binary.

void CColumnDefArray::SerializeData(CArchive& Archive)
{
    int iIndex;
    int iSize;
    int iTemp;
    CString sTemp;

    if (Archive.IsStoring())
    {
        Archive << m_iBaseDataCol;
        Archive << m_iNPValueCol;

        iSize = GetSize();
        Archive << iSize;
        for (iIndex = 0; iIndex < iSize; iIndex++)
        {
            CColumnDef& ColumnDef = ElementAt(iIndex);
            Archive << (int)ColumnDef.GetColumnType();
            Archive << ColumnDef.GetColumnId();
            sTemp = ColumnDef.GetName();
            Archive << sTemp;
        }
    }
}

And this is how I am trying to read it in C#.

The following can decode "ABC" but not the Russian charactors. I've tested this.Encoding with all available options (Ascii, UTF7 and etc). Russian characters works only for Encoding.Default. But apparently that's not a reliable option as encoding and decoding usually happens in different PCs.

        public override string ReadString()
        {
            byte blen = ReadByte();
            if (blen < 0xff)
            {
                // *** For russian characters it comes here.***
                return this.Encoding.GetString(ReadBytes(blen));
            }

            var slen = (ushort) ReadInt16();
            if (slen == 0xfffe)
            {
                throw new NotSupportedException(ServerMessages.UnicodeStringsAreNotSupported());
            }

            if (slen < 0xffff)
            {
                return this.Encoding.GetString(ReadBytes(slen));
            }

            var ulen = (uint) ReadInt32();
            if (ulen < 0xffffffff)
            {
                var bytes = new byte[ulen];
                for (uint i = 0; i < ulen; i++)
                {
                    bytes[i] = ReadByte();
                }

                return this.Encoding.GetString(bytes);
            }

            //// Not support for 8-byte lengths 
            throw new NotSupportedException(ServerMessages.EightByteLengthStringsAreNotSupported());
        }

What is the correct approach to decode this? Do you think selecting the right code page is the way to solve this? If so how to know which code page was used to encode?

Appreciate if someone can show me the right direction to get this done.

Edit

I guess this Question and "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" article solve some doubts. Apparently there is no way to find the right code page for existing data.

I guess now the question is: Is there any code page that support all Spanish, Russian and English characters? Can I specify the code page in C++ CArchive class?

Just as an aside, for (de)serialization code if you are storing in binary, you should really consider storing *only* fixed width types (e.g. [`std::int32_t`](http://en.cppreference.com/w/cpp/types/integer)). Consider if you save a file using a 32-bit application then try to load that file in a 64-bit application. The `sizeof(int)` may (and probably will) be different so you'll be parsing the binary file incorrectly. http://stackoverflow.com/questions/589575/what-does-the-c-standard-state-the-size-of-int-long-type-to-be — Cory Kramer, Oct 05 '16 at 11:37
@CoryKramer: Actually that's the case in here. C++ application is 32 bit and C# application is 64bit. — CharithJ, Oct 05 '16 at 11:41
What should the "ÁåëÀÇ 7555Â" decode into? "Белаз 555В"? If so, use `Encoding.GetEncoding(866)`. — Anton Gogolev, Oct 05 '16 at 11:44
@AntonGogolev: "ÁåëÀÇ 7555Â" is the plain text and CArchive encodes it and I cannot read that encoded text in C#. I'll try with that code page... — CharithJ, Oct 05 '16 at 12:00
I just realized my previous answer was completely useless because you already said *"C++ application ... doesn't support unicode characters"* I missed that, and when I asked you in comment, your answer wasn't clear. Anyway, it's now clear. I'll look in to see what it does. Can you show more of your c# deserialize code? — Barmak Shemirani, Oct 07 '16 at 03:50
@BarmakShemirani: I've posted my deserialize code in the question. Look at the ReadString method in C#. We can't change the C++ code as it's coming from our legacy application. But we can change our C# code in away to support whatever the format that CArchive use. I've tried with many different Code Pages but no luck. The only way I could get the same "ÁåëÀÇ 7555Â" string back is by using the Default encoding. System.Text.Encoding.Default.GetString(bits); — CharithJ, Oct 07 '16 at 03:55
This is what C++ CArchive provides : For "ABC" I get "03 41 42 43". For "ÁåëÀÇ 7555Â" I get "0B C1 E5 EB C0 C7 20 37 35 35 35 C2". — CharithJ, Oct 07 '16 at 03:59

Barmak Shemirani · Accepted Answer · 2016-10-09T16:28:48.400

The non-Unicode C++ program writes the data as 0B C1 E5 EB C0 C7 20 37 35 35 35 C2 (string's length, followed by bytes)

"ÁåëÀÇ 7555Â" is the representation of bytes in code page 1252

On English language computer, the following code returns "ÁåëÀÇ 7555Â". This works if both programs use the same code page:

string result = Encoding.Default.GetString(bytes);

You can also use code page 1252 directly. This will guarantee that the result is always "ÁåëÀÇ 7555Â" for that specific set of bytes:

//result will be `"ÁåëÀÇ 7555Â"`, always
Encoding cp1252 = Encoding.GetEncoding(1252);
string result = cp1252.GetString(bytes);

However this may not solve any problem. Consider an example with Greek text:

string greek = "ελληνικά";
Encoding cp1253 = Encoding.GetEncoding(1253);
var bytes = cp1253.GetBytes(greek);

bytes will be the similar to the output from the C++ program. You can use the same technique to extract text:

//result will be "åëëçíéêÜ"
Encoding cp1252 = Encoding.GetEncoding(1252);
string result = cp1252.GetString(bytes);

The result is "åëëçíéêÜ". But the desired result is "ελληνικά"

//result will be "ελληνικά"
Encoding cp1253 = Encoding.GetEncoding(1253);
string greek_decoded = cp1253.GetString(bytes);

So in order to do the correct conversion you must have the original code page which the C++ program was using (I am just repeating Hans Passant)

You can make the following modification:

public override string ReadString()
{
    //Default code page if both programs use the same code page
    Encoding encoder = System.Text.Encoding.Default;

    //or find out what code page the C++ program is using
    //Encoding encoder = System.Text.Encoding.GetEncoding(codepage);

    //or use English code page to always get "ÁåëÀÇ 7555Â"...
    //Encoding encoder = System.Text.Encoding.GetEncoding(1252);
    //(not recommended)

    byte blen = ReadByte();
    if (blen < 0xff)
        return encoder.GetString(ReadBytes(blen));

    var slen = (ushort)ReadInt16();
    if (slen == 0xfffe)
        throw new NotSupportedException(
            ServerMessages.UnicodeStringsAreNotSupported());

    if (slen < 0xffff)
        return encoder.GetString(ReadBytes(blen));

    var ulen = (uint)ReadInt32();
    if (ulen < 0xffffffff)
    {
        var bytes = new byte[ulen];
        for (uint i = 0; i < ulen; i++)
            bytes[i] = ReadByte();
        return encoder.GetString(ReadBytes(blen));
    }

    throw new NotSupportedException(
        ServerMessages.EightByteLengthStringsAreNotSupported());
}

Additional comments:

The non-Unicode MFC program can take input in English or Russian, but not both languages at the same time. These old programs use char to store up to 255 letters per byte. 255 is not enough room for all the alphabets in English, Russian, Greek, Arabic...

Code page 1252 maps the characters to Latin alphabets. While code page 1253 maps the characters to Greek alphabet and so on.

Therefore your MFC file contains only one language of one code page.

Western European languages (English, Spanish, Portuguese, German, French, Italian, Swedish, etc.) use code page 1252. If users stay within this language group then there should not be much trouble. System.Text.Encoding.Default should solve the problem, or better yet System.Text.Encoding.GetEncoding(variable_codepage)

There are some relevant ANSI code pages in Windows

874 – Windows Thai
1250 – Windows Central and East European Latin 2
1251 – Windows Cyrillic
1252 – Windows West European Latin 1
1253 – Windows Greek
1254 – Windows Turkish
1255 – Windows Hebrew
1256 – Windows Arabic
1257 – Windows Baltic
1258 – Windows Vietnamese

Some Asian languages are not supported without Unicode. Some Unicode symbols are not supported in ANSI, nothing can be done about that.

It is possible to force the non-unicode program to use more than one code page. But it is not practical. It is much easier to upgrade to Unicode and do this right.

See also The Minimum Software Developers Must Know About Unicode

I added code that prints out the binary data in hexadecimal format. Can you replicate that? Copy/past the first 50 or so characters so we can see if it contains Unicode or not. — Barmak Shemirani, Oct 07 '16 at 00:04
For "ABC" I get "03 41 42 43". For "ÁåëÀÇ 7555Â" I get "0B C1 E5 EB C0 C7 20 37 35 35 35 C2". It is not unicode, You would think it's ASCII by looking at "ABC" encoding. But that's not true when we look at the second string which contains non-ascii characters. May be I need to selected the right code page? — CharithJ, Oct 07 '16 at 02:14
Can you fit this in your function: `string result = Encoding.Default.GetString(buf)`? to get the encoding from system's current ANSI code page, which is what MFC is doing. It should be okay as long as the two programs are on the same computer. — Barmak Shemirani, Oct 07 '16 at 05:14
I'll try your new answer. Encoding.Default works. But we cannot guarantee that encoding and decoding happens in the same kind of PCs. — CharithJ, Oct 07 '16 at 05:17
I am not sure how your proposed solution can help for this issues. When you look at my answer such characters always < 255 and goes into this if condition. 'byte blen = ReadByte(); if (blen < 0xff) { return this.Encoding.GetString(ReadBytes(blen)); }' — CharithJ, Oct 07 '16 at 10:02
Show the class declaration behind `public override string ReadString(){...}` I can't reproduce this line `this.Encoding.GetString(bytes);` — Barmak Shemirani, Oct 07 '16 at 17:16
It derives from BinaryReader. BinaryReader.ReadString is the base. — CharithJ, Oct 08 '16 at 12:02
:So, if we write the codepage id in the blob. How will it work with different languages? Trying to figure out if it is possible at all. Say for an example, English user write something in English and then a Russian user add something in russian and finally an spanish user also edit it. Is there any codepage that supports all Spanish, Russian and English characters? Unfortunately unicode is not an option. — CharithJ, Oct 09 '16 at 01:53
See edit for more explanation, I couldn't fit it in comment. — Barmak Shemirani, Oct 09 '16 at 07:35
For "ÁåëÀÇ 7555Â", byte array is "0B C1 E5 EB C0 C7 20 37 35 35 35 C2". It displays ????? for the first 5 characters when the Encoding is ASCII.But when you look at the Ascii table (http://www.asciitable.com/), for C1 (193 in decimal), it should display the mapping character from extended ASCII table? But how does the .net framework figure out it fails to decode and provide '?' instead of the character from extended ascii table. I am just wondering how it decides that character set map/decode fails? How does it show '?' instead of the relevant char for the given ascii value? — CharithJ, Oct 10 '16 at 11:15
The link you are referring does not define a standard for "Extended ASCII". If you look at other web pages you may find different mapping for characters above 128. You need ANSI 1252 (default code page for Western European languages), or other ANSI code pages. — Barmak Shemirani, Oct 11 '16 at 01:22

How select the right codepage to decode the content encoded by CArchive

1 Answers1