ISO6937 to UTF8 give wrong results in C#

Question

I'm reading some binary data from file in C#, also strings which need to be correctly decoded.

I have no problems with for example windows-1251 codepage, but I have incorrect results for ISO6937 - looks like C# is ignoring two byte chars (accent+char).

I'm using this way to decode string from byte:

Encoding.Convert(Encoding.GetEncoding("20269"), Encoding.UTF8, data)

Example:

Kraków

byte[] = 4B 72 61 6B C2 6F 77

result - Krak´ow

I did some research, but I find only some code from MediaPortal at their GitHub, which manually read two byte chars - this is not the nicest way.

Am I doing something wrong or this is Visual Studio bug? (why they gave ability to encode to ISO6937, if this is not working incorrectly?)

This isn't a "Visual Studio" bug, and it isn't a C# bug; it *is* possible that the .NET framework and/or Windows has a glitch in that encoding. Are you 100% sure that this is the correct byte representation for that string? I'm genuinely not familiar with ISO6937. I'll be honest, though: it would be *so much easier for you* if you moved to UTF8. As for "why they gave ability to encode to ISO6937, if this is not working incorrectly?" - *if* it is bugged, it is probably because they didn't know it was faulty at the time. Bugs happen. — Marc Gravell, Apr 24 '17 at 19:39
Thanks for answer, unfortunately binary data is not generated by me, I got no choice and have to read it as it is saved by some very old app :( String is also OK: 4B 72 61 6B C2 6F 77 K r a k [ ó ] w — niknejm, Apr 24 '17 at 19:42
Fortunately, this *actually* looks like a pretty easy encoding to implement - there are only 13 multi-byte markers, so you can just check the high bit (or heck, just special-case for 0xC1-0xCF) - the substitution map is also minimal - https://en.wikipedia.org/wiki/ISO/IEC_6937 - so yes, I agree that it is very vexing and disappointing that it didn't work correctly via `Encoding.GetEncoding`, but: it should be about an hours work to work around it... — Marc Gravell, Apr 24 '17 at 19:44
Using the `ByteArrayToString` method found [here](http://stackoverflow.com/questions/311165/how-do-you-convert-byte-array-to-hexadecimal-string-and-vice-versa), and running this: `Console.WriteLine(ByteArrayToString(Encoding.GetEncoding(20269).GetBytes("Kraków".ToCharArray())));`, I get these bytes which are different than yours: "4b72616b3f77". Maybe I'm missing something when converting the c# characters? Also note I am using `Encoding.GetEncoding(20269)` because `Encoding.GetEncoding("20269")` threw an exception for me. — Quantic, Apr 24 '17 at 19:47
@Quantic I think the point is : the encoding doesn't work :) checking wikipedia, 4B 72 61 6B C2 6F 77 (from the OP) is indeed the correct bytes — Marc Gravell, Apr 24 '17 at 19:51
Thanks Marc, it's exactly like You said - for some reason it doesn't work, bytes after encoding will be wrong (instead of treat C2 as two byte char start it will encode C2 and 6F separately) and looks like I will have to implement this on my own just like MediaPortal guys did. — niknejm, Apr 24 '17 at 20:13

score 2 · Accepted Answer · answered Apr 24 '17 at 21:51

The Wikipedia page for the encoding does hint at the underlying problem. Quote: "ISO/IEC 6937 does not encode any combining characters whatsoever". So formally the .NET encoder does what the standard says, practically it is not useful.

This can be done better than the linked GitHub code, the much cleaner approach is to make your own Encoding class. About all of the work can be delegated to the .NET encoding, you just have to intercept the diacritics. Which requires using the combining mark and swapping it with the letter. Like this:

class ISO6937Encoding : Encoding {
    private Encoding enc = Encoding.GetEncoding(20269);

    public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex) {
        int cnt = enc.GetChars(bytes, byteIndex, byteCount, chars, charIndex);
        for (int ix = 0; ix < byteCount; ix++, charIndex++) {
            int bx = byteIndex + ix;
            if (bytes[bx] >= 0xc1 && bytes[bx] <= 0xcf) {
                if (charIndex == chars.Length - 1) chars[charIndex] = '?';
                else {
                    const string subst = "\u0300\u0301\u0302\u0303\u0304\u0306\u0307\u0308?\u030a\u0337?\u030b\u0328\u030c";
                    chars[charIndex] = chars[charIndex + 1];
                    chars[charIndex + 1] = subst[bytes[bx] - 0xc1];
                    ++ix;
                    ++charIndex;
                }
            }
        }
        return cnt;
    }
    // Rest is boilerplate
    public override int GetByteCount(char[] chars, int index, int count) {
        return enc.GetByteCount(chars, index, count);
    }
    public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex) {
        return enc.GetBytes(chars, charIndex, charCount, bytes, byteIndex);
    }
    public override int GetCharCount(byte[] bytes, int index, int count) {
        return enc.GetCharCount(bytes, index, count);
    }
    public override int GetMaxByteCount(int charCount) {
        return enc.GetMaxByteCount(charCount);
    }
    public override int GetMaxCharCount(int byteCount) {
        return enc.GetMaxCharCount(byteCount);
    }
}

Not extensively tested.

Thanks for this code, I see no issues after my tests - much prettier and cleaner solution than hundreds of case/break - You are genius! :D — niknejm, Apr 25 '17 at 17:55

ISO6937 to UTF8 give wrong results in C#

1 Answers1

Linked