18

I have a byte array: 00 01 00 00 00 12 81 00 00 01 00 C8 00 00 00 00 00 08 5C 9F 4F A5 09 45 D4 CE

It is read via StreamReader using UTF8 encoding

// Note I can't change this code, to many component dependent on it.
using (StreamReader streamReader = 
    new StreamReader(responseStream, Encoding.UTF8, false))
{
    string streamData = streamReader.ReadToEnd();
    if (requestData.Callback != null)
    {
        requestData.Callback(response, streamData);
    }
}

When that function runs I get the following returned to me (i converted to a byte array)

00 01 00 00 00 12 EF BF BD 00 00 01 00 EF BF BD 00 00 00 00 00 08 5C EF BF BD 4F EF BF BD 09 45 EF BF BD

Somehow I need to take whats returned to me and get it back to the right encoding and the right byte array, but I've tried alot.

Please be aware, I'm working with WP7 limited API.

Hopefully you guys can help.

Thanks!

Update for help...

if I do the following code, it's almost right, only thing that is wrong is the 5th to last byte gets split out.

byte[] writeBuf1 = System.Text.Encoding.UTF8.GetBytes(data);
                    string buf1string = System.Text.Encoding.BigEndianUnicode.GetString(writeBuf1, 0, writeBuf1.Length);
                    byte[] writeBuf = System.Text.Encoding.BigEndianUnicode.GetBytes(buf1string);
John
  • 185
  • 1
  • 1
  • 5
  • Can you show us the code that is writing/creating the array? – Emond Jul 01 '11 at 04:30
  • Nope, it's comming from a third party service, that's the exact data that the service returns... Besides, I just want to get it back to what it's supposed to be (as it stands in the response stream) – John Jul 01 '11 at 04:52
  • I am seriously boggled on this one... – John Jul 01 '11 at 05:29
  • Then how do you know in what encoding and byte-order the stream is written to? – Emond Jul 01 '11 at 05:29
  • Can you attach a network sniffer (Fiddler) to see what is actually being transmitted? – Emond Jul 01 '11 at 05:31
  • Please note the array changed, but here's a screenshot of the fiddler hex http://imageshack.us/photo/my-images/818/returnz.png/ – John Jul 01 '11 at 05:39
  • http://stackoverflow.com/questions/25222973/weird-characters-in-url – trante Aug 13 '14 at 22:24

1 Answers1

41

The original byte array is not encoded as UTF-8. The StreamReader therefore replaces each invalid byte with the replacement character U+FFFD. When that character gets encoded back to UTF-8, this results in the byte sequence EF BF BD. You cannot construct the original byte value from the string because the information is completely lost.

StackzOfZtuff
  • 2,534
  • 1
  • 28
  • 25
Roland Illig
  • 40,703
  • 10
  • 88
  • 121
  • That's what I was afraid of... So the only way to really not lose the data is figure out what the encoding is and read like that? Unfortunatly, for some reason I can't just read a byte array, the Stream requires a streamreader to read... – John Jul 01 '11 at 06:18
  • 1
    Yes, and when you are in doubt, use `ISO-8859-1`, so you will get a simple 1:1 mapping from bytes to characters. Just for curiosity: Why would anyone want to read a byte stream like this (which is obviously non-character data) as a character stream? – Roland Illig Jul 01 '11 at 06:24
  • Can't you ask the source of the stream for a specification? – Emond Jul 01 '11 at 06:52
  • Everything is (and has been) character data except for this one new part. Eitherway, I just added some overrides to get the actual byte[] optionally and all seems well with the ISO-8859-1 encoding. Thanks! – John Jul 01 '11 at 13:34
  • 1
    Wow, holy shit, so these bytes are pretty good markers of incorrect encoding being used! – mike nelson Dec 14 '15 at 18:55