1

I was having an issue with the Utf7Encoding class truncating the '+4' sequence. I would be very interested to know why this was happening. I tried Utf8Encoding for getting string from the byte[] array and it seem to work honky dory. Are there any known issues like that with Utf8? Essentially I use the output produced by this conversion to construct html out of rtf string.

Here is the snippet:

    UTF7Encoding utf = new UTF7Encoding(); 
    UTF8Encoding utf8 = new UTF8Encoding(); 

    string test = "blah blah 9+4"; 

    char[] chars = test.ToCharArray(); 
    byte[] charBytes = new byte[chars.Length]; 

    for (int i = 0; i < chars.Length; i++) 
    { 

        charBytes[i] = (byte)chars[i]; 

     }


    string resultString = utf8.GetString(charBytes); 
    string resultStringWrong = utf.GetString(charBytes); 

    Console.WriteLine(resultString);  //blah blah 9+4  
    Console.WriteLine(resultStringWrong);  //blah 9  
dexter
  • 7,063
  • 9
  • 54
  • 71
  • Is this C#? If so you might want to tag it as such. – Laurence Gonsalves Nov 19 '10 at 19:45
  • Interesting find, definitely not expected behavior. – leppie Nov 19 '10 at 20:02
  • Actually, I think you are looking for ASCII encoding, utf-7 I suspect is also encoded like utf-8. – leppie Nov 19 '10 at 20:04
  • I am just reluctant to proceed to fixing this with the Utf8Encoding as I am not convinced there would not be other issues. And I can not have any truncation on my data as I deal with medical info...Really want to find out if this is a bug and why this is happening at the deeper level. I have not started reflecting yet but feeling that might be the next step. – dexter Nov 19 '10 at 20:06

2 Answers2

1

Your are not transating the string to utf7 bytes correctly. You should call utf.GetBytes() instead of casting the characters to a byte.

I suspect that in utf7 the ascii code corresponding to '+' is actually reserved for encoding international unicode characters.

1

Converting to byte array through char array like that does not work. If you want the strings as charset-specific byte[] do this:

UTF7Encoding utf = new UTF7Encoding();
UTF8Encoding utf8 = new UTF8Encoding();

string test = "blah blah 9+4";

byte[] utfBytes = utf.GetBytes(test);
byte[] utf8Bytes = utf8.GetBytes(test);

string utfString = utf.GetString(utfBytes);
string utf8String = utf8.GetString(utf8Bytes);

Console.WriteLine(utfString);  
Console.WriteLine(utf8String);

Output:

blah blah 9+4

blah blah 9+4

Community
  • 1
  • 1
Steve Townsend
  • 53,498
  • 9
  • 91
  • 140
  • OK, so why the Utf8Encoding treats the wrong constructed byte array in the way I would expect? – dexter Nov 19 '10 at 20:14
  • Sheer good luck that the `char` and `byte` representations coincide - in the general case, your `Encoding` class used to map from `string` to `byte[]` could be the one for any multibyte charset. `string` is always Unicode. – Steve Townsend Nov 19 '10 at 20:16