Can I mix UTF-16 conversion with UTF-8 conversion between bytes and string?

Question

Short version

Is this an identity function?

f = (gₐ · hᵤ · gᵤ · hₐ)

where:

hₐ is the UTF-16 conversion from bytes to string,
gₐ is the UTF-16 conversion from string to bytes,
gᵤ is the Encoding.UTF8.GetBytes(),
hᵤ is the Encoding.UTF8.GetString(),

Long version

I'm using WebSocket4Net to send and receive messages through WebSockets between a C# application and a C# service.

Some messages being binary, I should convert them from and to strings when interacting with the library, since while its Send() method enables to send an array of bytes, its MessageReceived communicates the received message as a string only.

To convert bytes to string and string to bytes, I follow the answer by Mehrdad where the internal encoding of .NET Framework, i.e. UTF-16, is used.

On the other hand, according to the code source (see for example DraftHybi10Processor.cs, line 114), WebSocket4Net uses UTF-8 to convert string to bytes and bytes to string.

Would it cause issues? Is data loss possible?

How can a string to byte conversion and vice versa ever be encoding agnostic? — Luaan, Mar 20 '14 at 15:31
@Luaan: the answer by Mehrdad (see the link in my question) explains that, and why using encoding makes no sense. — Arseni Mourzenko, Mar 20 '14 at 15:34
The answer by Mehrdad is quite flawed. He's still using an encoding, he just uses UTF-16 encoding without realizing it (and killing portability, thanks to endianness issues). I don't see how that's better than using an explicit encoding. Also, encoding a unicode-to-bytes array using UTF-8 is a huge waste of space :) — Luaan, Mar 20 '14 at 15:35
@Luaan is dead on... You cannot convert from string to bytes without an ecoding. It's just not possible. — Kevin, Mar 20 '14 at 15:37
@Kevin it is possible, and Mehrdad's answer does it, even if it doesn't explain it well enough. Regardless of the string's internal encoding, it is represented as a sequence of bytes. By simply taking that sequence of bytes without in any way transforming the data or performing a conversion, you have converted the string to a sequence of bytes. Yes, it is possible to do that. The resulting byte sequence obviously depends on the string's internal encoding, but it can be done, and it can be done regardless of the encoding used internally by the string. — jalf, Mar 20 '14 at 15:40
@MainMa: That answer is a very dangerous way of doing *one thing exactly*. As a testament to the danger, you have not understood at all what he's doing there. As Luaan says, "encoding-agnostic conversion" does not make sense. It's not a *conversion*, it's a *reinterpretation*. — Jon, Mar 20 '14 at 15:40
Whether or not it is *useful* is a different question, of course. :) As to the OP, why would you make life so difficult for yourself? Given the string, convert it to UTF-8 and send those bytes. And when reading bytes at the other end, create a string from them using the UTF-8 encoding. — jalf, Mar 20 '14 at 15:41
@Luaan: I know that. Should I reformulate my question in terms of UTF-8 encoding coupled with UTF-16 encoding? I thought the actual formulation is easier to understand; it seems that it's not. — Arseni Mourzenko, Mar 20 '14 at 15:42
@jalf Not explicitly using an encoding from a string is still using an encoding. With the bonus of being completely screwed and not knowing why when you use those bytes elsewhere. — Kevin, Mar 20 '14 at 15:42
@Kevin yes, you will be completely screwed if you send those bytes to code that makes different assumptions about the encoding. And yes, the bytes you get from the string have an encoding, but your "conversion" doesn't *care* about the encoding. The encoding is irrelevant, and the "conversion" would still work even if the string contained garbage data. But it is a dumb thing to do and certainly not the right way to serialize a string. I'm just saying that "yes, you can most certainly get the raw bytes out of a string without caring or knowing which encoding the string uses" — jalf, Mar 20 '14 at 15:45
Incidentally, that is exactly what Mehrdad's answer describes: how to get the bytes of a string object into an array. Encoding is not relevant for that operation. The encoding is only relevant if you want to preserve the *meaning* of the bytes. If you just want the dumb byte sequence without caring about their semantics, then the encoding does not come into play. — jalf, Mar 20 '14 at 15:47
Regardless, this is a silly discussion of semantics. We all agree that the only way to get a *meaningful* byte sequence from a string is to serialize it to a specific, known, encoding. The nonsensical operations you can do to get *some* byte sequence without knowing its meaning (its encoding) don't really matter, whether or not we want to describe them as "implicitly using an encoding" or not. — jalf, Mar 20 '14 at 15:51
@Luaan: I changed the terms used in my question to make UTF-16 more explicit. I hope this makes things clearer and shifts the attention from the validity of Mehrdad's answer to the question of mixing encodings. — Arseni Mourzenko, Mar 20 '14 at 15:51
@jalf I agree. However the op wants to translate from string -> bytes -> string. — Kevin, Mar 20 '14 at 15:51
Mehrdad's statement "I don't understand why..." should give you pause when combined with the fact that character encoding has so many standards and RFC to define it. So many others worry about it that it probably does have its reasons (and it does). As many said, his version is simply unsafe interpretation of .NET's internal string representations (which happens to be UTF-16). True encoding conversion is not hard and should be done right to avoid future problems. Either pick an encoding as standard for your protocol, or have flexibility in your protocol to handle various encodings. — LB2, Mar 20 '14 at 15:56
@LB2 did you read the question he answered? That question was simply about getting the raw bytes of the string's internal representation, and he is right, for that purpose, it is incomprehensible that people kept going on about encodings. Of course, if you want a *useful* byte sequence, you should most certainly keep encodings in mind. But that wasn't the question he answered. The question *was* about the unsafe interpreation of .NET's internal string representation. Anyway, I don't see how that is relevant *here*, to *this* question. — jalf, Mar 20 '14 at 23:01

Luaan · Accepted Answer · 2014-03-20T16:10:22.407

2

If you need to send binary data as a string, well, that's what Base-64 and similar encodings are for. If you need to send a string as string... well, send it as a string. If you need to send a string as bytes, Unicode (UTF-16) or UTF-8 will do just fine. Strings aren't simple byte arrays (even if they can be represented that way if necessary). Unicode especially is quite a complicated encoding (see http://www.joelonsoftware.com/articles/Unicode.html; read it - it's a must). Did you know that you can get a unicode normalization that splits a single character into 5 bytes? The same character could also be interpreted as 2. Or a completely different number. I haven't observed it, but I'd expect that some byte arrays will be outright invalid in UTF-16 (which is the current default string encoding in .NET).

I'm not going to go through the proof that your "double-encoding" is flawed. I'm not sure, it might even work. However, the string you're going to get is going to be pretty silly and you'll have a lot of trouble encoding it to make sure that you're not sending commands or something.

The more important thing is - you're not showing intent. You're doing micro-optimalizations, and sacrificing readability. Worse, you're relying on implementation details, which aren't necessarily portable or stable with respect to later versions of .NET, not to mention other environments.

Unless you have a very, very good reason (based on actual performance analysis, not a "gut feeling"), go with the simple, readable solution. You can always improve if you have to.

EDIT: A sample code to show why using Unicode to encode non-unicode bytes is a bad idea:

Encoding.UTF8.GetBytes(Encoding.UTF8.GetString(new byte[] { 200, 8 }))

The two bytes on input turned into four bytes, { 239, 191, 189, 8 }. Not quite what you wanted.

edited Mar 20 '14 at 16:10

answered Mar 20 '14 at 15:42

Luaan

62,244
7
97
116

Base-64 is indeed a way to do it in general, but not in my case. In my case, WebSockets communication is often done through a low-speed network connection, and having a 4:3 ratio would be annoying. – Arseni Mourzenko Mar 20 '14 at 15:46
Yes some byte arrays are illegal UTF-8 or UTF-16. – Kevin Mar 20 '14 at 15:50
Well, until web sockets allow sending binary data directly (which is a feature that's being implemented - WebSocket4Net does support it already), that's the only reliable and simple way. If you really want to send the chars, ignore UTF-8 and UTF-16 - those are complicated. Instead, encode it in some ASCII encoding - those aren't languages, just primitive char-byte tables. That conversion is guaranteed to be 1:1 (every character is unique and has a single-byte representation). The best is actually 7-bit ASCII (ie. not extended), if you handle the "dangling" bit - that guarantees perfect UTF-8. – Luaan Mar 20 '14 at 15:55
@MainMa An important point to note as well is that UTF-8 characters which can be represented as a shorter "byte array", should be. So if you have a 2-byte UTF-8 "character", that can be converted into a 1-byte UTF-8 - your 2 input bytes are now mangled into 1. Not a problem with a string (the string is still the same), but it's no longer the same byte array. – Luaan Mar 20 '14 at 16:07
@MainMa And to add further to your size argument, the data you're sending is being encoded for the transfer. UTF-8 handles Base-64 just great (they're all characters that fit into just one byte). But encoding UTF-16 characters created from what might as well be random binary data? That's going to hurt. I'm quite sure that if you actually tried to profile your approach, you would end up with bigger data on average with your method than with base-64. – Luaan Mar 20 '14 at 16:13
@Luaan which implementations do not support binary websockets? AFAIK it's a pretty safe assumption that that is supported. – jalf Mar 20 '14 at 23:04
@jalf It might be that you're right. Binary support was added to the spec almost three years ago. I haven't tried it myself, though :) – Luaan Mar 21 '14 at 08:56
1

@Luaan we use it fairly heavily, but only with browser clients, and only connecting to our home-written Websocket server. As far as browsers go, almost all of them went straight from "Websockets not implemented or disabled by default" to "websockets work fine, in binary as well as UTF-8 mode". I think Safari 5.1 might have been a problem, but nothing more recent or widely used than that. (But I can't speak for non-browser implementations) – jalf Mar 21 '14 at 12:37

Can I mix UTF-16 conversion with UTF-8 conversion between bytes and string?

Short version

Long version

1 Answers1