7

In .NET why isn't it true that:

Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))

returns the original byte array for an arbitrary byte array x?

It is mentioned in answer to another question but the responder doesn't explain why.

Community
  • 1
  • 1
PyreneesJim
  • 155
  • 1
  • 3
  • 10
  • The answer you linked to talks about ASCII, not UTF-8. – svick Mar 16 '12 at 16:01
  • 1
    Can you even compare byte arrays using `==`? That probably just compares their references, you will probably have to make a loop to compare each element of the array for equality. – Matthew Mar 16 '12 at 16:06
  • @Matthew the gist of [that answer](http://stackoverflow.com/a/3946274/85371) seems to be that the encoding may vary. And yes the example code is flawed/backwards. – sehe Mar 16 '12 at 16:14
  • The explanation is simple: Not every arbitrary sequence of bytes is a valid UTF-8 encoding. Interpreting something as UTF-8 that isn't, will produce unexpected results. Converting a UTF-8 encoded string back to a byte buffer will thus not necessarily produce the original sequence. The solution really is to use an encoding that can encode an arbitrary byte sequence (like Base64). Everything said about UTF-8 in this comment is true for ASCII as well (which the linked question is using), and the core issue is the same. – IInspectable Sep 02 '16 at 01:34

3 Answers3

3

First, as watbywbarif mentioned, you shouldn't compare sequences by using ==, that doesn't work.

But even if you compare the arrays correctly (e.g. by using SequenceEquals() or just by looking at them), they aren't always the same. One case where this can occur is if x is an invalid UTF-8 encoded string.

For example, the 1-byte sequence of 0xFF is not valid UTF-8. So what does Encoding.UTF8.GetString(new byte[] { 0xFF }) return? It's �, U+FFFD, REPLACEMENT CHARACTER. And of course, if you call Encoding.UTF8.GetBytes() on that, it doesn't give you back 0xFF.

svick
  • 236,525
  • 50
  • 385
  • 514
2

Another angle to come at this from is that the Encoding classes are designed to round-trip data, but the data they're designed to round-trip is char data, encoded to byte, not the other way around. What this means is that, within the capabilities of the Encoding in question, each char value has a corresponding encoding in byte values (1 or more) that will turn back into exactly the same char value. (It is worth noting that not all Encodings can do this for all possible char values -- for instance, Encoding.ASCII can only support char values in the range [0, 128).)

So, if you're starting with character data and you need a way to store or send it in a medium that works with bytes (such as a file on disk or a network stream), Encoding is an excellent way to convert the char data to byte data and then back again on the other end. (If you want to support all possible strings, you'll need to use one of the Unicode-based Encodings, such as Encoding.Unicode or Encoding.UTF8.)

So, what does this mean if you're starting with a bunch of bytes? Well, depending on the encoding in question, the bytes you're working with might not actually be a sequence that Encoding would ever have output. You need to look at Encoding.GetBytes as an encoding operation, and Encoding.GetChars/Encoding.GetString as a decoding operation, and so you're starting with an arbitrary array of bytes and trying to decode them.

For an analogy, consider the JPEG file format for images. This has a similar type of encoding and decoding, where in this case the decoded data isn't a string but an image. So, if you take an arbitrary string of bytes, what are the chances that it could be decoded as a JPEG image? The answer to that, obviously, is very very slim. More likely, your bytes will end up going down a path in the decoder that says, "Woah there, I wasn't expecting that byte to come after that other one", and it will do its best to handle the data on the assumption that it is a valid JPEG file that got damaged somehow.

Exactly the same thing happens when you convert an arbitrary array of bytes to a string. The UTF-8 encoding has specific rules about how char values 128 and up get encoded, and one of those rules says that you will only ever see a byte matching the bit pattern 10xxxxxx after one that matches a pattern like 110xxxxx, 1110xxxx or 11110xxx, which "introduces" a multi-byte sequence (multiple bytes representing a single char). So, if your data contains a byte matching the pattern 10xxxxxx that doesn't follow one of the expected "introducers", the encoder can only assume that the data got damaged somehow. What does it do? It inserts a character that says, "Something went horribly wrong with the encoded data. I tried my best. This is where it went wrong." The people who designed Unicode anticipated this exact scenario and created a character with this precise meaning: the Replacement Character.

So, if you're trying to round-trip your bytes in a string of chars and this scenario is encountered, the actual value of the offending byte gets lost, and instead a Replacement Character is inserted. When you try to turn the string back into a byte array, it ends up encoding the Replacement Character, not the original data. The original data is lost.

What you're looking for is an encoding & decoding relationship that works in the other direction. Encoding is for taking char data and finding a way to temporarily store it as byte data. If you want to take byte data and find a way to temporarily store it as char data, you need an encoding designed for that specific purpose. Fortunately, these exist. Wikipedia has a fairly comprehensive list of the options. :-)

Within the .NET Framework, the simplest and most accessible option is MIME Base-64 encoding, which is exposed via Convert.ToBase64String and Convert.FromBase64String.

Jonathan Gilbert
  • 3,526
  • 20
  • 28
1

This is because == will not compare each element of array. It has no connection with Encoding.UTF8. Check this:

var a = new byte[] { 1 };
var b = new byte[] { 1 };
bool res = a == b;
watbywbarif
  • 6,487
  • 8
  • 50
  • 64