In .NET why isn't it true that:
Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))
returns the original byte array for an arbitrary byte array x
?
It is mentioned in answer to another question but the responder doesn't explain why.
In .NET why isn't it true that:
Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(x))
returns the original byte array for an arbitrary byte array x
?
It is mentioned in answer to another question but the responder doesn't explain why.
First, as watbywbarif mentioned, you shouldn't compare sequences by using ==
, that doesn't work.
But even if you compare the arrays correctly (e.g. by using SequenceEquals()
or just by looking at them), they aren't always the same. One case where this can occur is if x
is an invalid UTF-8 encoded string.
For example, the 1-byte sequence of 0xFF
is not valid UTF-8. So what does Encoding.UTF8.GetString(new byte[] { 0xFF })
return? It's �, U+FFFD, REPLACEMENT CHARACTER. And of course, if you call Encoding.UTF8.GetBytes()
on that, it doesn't give you back 0xFF
.
Another angle to come at this from is that the Encoding
classes are designed to round-trip data, but the data they're designed to round-trip is char
data, encoded to byte
, not the other way around. What this means is that, within the capabilities of the Encoding
in question, each char
value has a corresponding encoding in byte
values (1 or more) that will turn back into exactly the same char
value. (It is worth noting that not all Encoding
s can do this for all possible char
values -- for instance, Encoding.ASCII
can only support char
values in the range [0, 128)
.)
So, if you're starting with character data and you need a way to store or send it in a medium that works with bytes (such as a file on disk or a network stream), Encoding
is an excellent way to convert the char
data to byte
data and then back again on the other end. (If you want to support all possible strings, you'll need to use one of the Unicode-based Encoding
s, such as Encoding.Unicode
or Encoding.UTF8
.)
So, what does this mean if you're starting with a bunch of byte
s? Well, depending on the encoding in question, the byte
s you're working with might not actually be a sequence that Encoding
would ever have output. You need to look at Encoding.GetBytes
as an encoding operation, and Encoding.GetChars
/Encoding.GetString
as a decoding operation, and so you're starting with an arbitrary array of bytes and trying to decode them.
For an analogy, consider the JPEG file format for images. This has a similar type of encoding and decoding, where in this case the decoded data isn't a string
but an image. So, if you take an arbitrary string of bytes, what are the chances that it could be decoded as a JPEG image? The answer to that, obviously, is very very slim. More likely, your bytes will end up going down a path in the decoder that says, "Woah there, I wasn't expecting that byte to come after that other one", and it will do its best to handle the data on the assumption that it is a valid JPEG file that got damaged somehow.
Exactly the same thing happens when you convert an arbitrary array of bytes to a string. The UTF-8 encoding has specific rules about how char
values 128 and up get encoded, and one of those rules says that you will only ever see a byte matching the bit pattern 10xxxxxx
after one that matches a pattern like 110xxxxx
, 1110xxxx
or 11110xxx
, which "introduces" a multi-byte sequence (multiple byte
s representing a single char
). So, if your data contains a byte matching the pattern 10xxxxxx
that doesn't follow one of the expected "introducers", the encoder can only assume that the data got damaged somehow. What does it do? It inserts a character that says, "Something went horribly wrong with the encoded data. I tried my best. This is where it went wrong." The people who designed Unicode anticipated this exact scenario and created a character with this precise meaning: the Replacement Character.
So, if you're trying to round-trip your byte
s in a string of char
s and this scenario is encountered, the actual value of the offending byte
gets lost, and instead a Replacement Character is inserted. When you try to turn the string
back into a byte
array, it ends up encoding the Replacement Character, not the original data. The original data is lost.
What you're looking for is an encoding & decoding relationship that works in the other direction. Encoding
is for taking char
data and finding a way to temporarily store it as byte
data. If you want to take byte
data and find a way to temporarily store it as char
data, you need an encoding designed for that specific purpose. Fortunately, these exist. Wikipedia has a fairly comprehensive list of the options. :-)
Within the .NET Framework, the simplest and most accessible option is MIME Base-64 encoding, which is exposed via Convert.ToBase64String
and Convert.FromBase64String
.
This is because == will not compare each element of array. It has no connection with Encoding.UTF8. Check this:
var a = new byte[] { 1 };
var b = new byte[] { 1 };
bool res = a == b;