C# big-endian UCS-2

Question

The project I'm currently working on needs to interface with a client system that we don't make, so we have no control over how data is sent either way. The problem is that were working in C#, which doesn't seem to have any support for UCS-2 and very little support for big-endian. (as far as i can tell)

What I would like to know, is if there's anything i looked over in .net, or something that someone else has made and released that we can use. If not I will take a crack at encoding/decoding it in a custom method, if that's even possible.

But thanks for your time either way.

EDIT: BigEndianUnicode does work to correctly decode the string, the problem was in receiving other data as big endian, so far using IPAddress.HostToNetworkOrder() as suggested elsewhere has allowed me to decode half of the string (Merli? is what comes up and it should be Merlin33069)

Im combing the short code to see if theres another length variable i missed

RESOLUTION: after working out that the bigendian variables was the main problem, i went back through and reviewed the details and it seems that the length of the strings was sent in character counts, not byte counts (in utf it would seem a char is two bytes) all i needed to do was double it, and it worked out. thank you all for your help.

In most (not all) cases, UCS-2 is the same as UTF-16; are you just looking for `Encoding.BigEndianUnicode` ? Note that this is really .NET not C# — Marc Gravell, Aug 07 '11 at 08:27
I strongly suspect the problem *isn't* a difference between UCS-2 and UTF-16. Please give some sample data demonstrating the problem - show the raw bytes, and what you'd *expect* the decoded text to be. — Jon Skeet, Aug 07 '11 at 08:33
Well, i found the issue, the client is in java, and our side is in c#, so when they send the string *length* its also in bigendian, so when we get the length in c# its different. — RyanTimmons91, Aug 07 '11 at 08:38
So the issue now is figuring out how to convert for sending/receiving EDIT i think i can just reverse the bytes, right? — RyanTimmons91, Aug 07 '11 at 08:39
@Merlin rather than *reversing* them (which could be incorrect on some systems) - I would simply read them and use "shift" operations... will add as an answer — Marc Gravell, Aug 07 '11 at 08:41
@Merlin33069: I suggest you edit your question to make this clearer - I'll edit my answer with some options. — Jon Skeet, Aug 07 '11 at 08:43

Ivan Danilov · Answer 1 · 2011-08-07T08:42:43.860

string x = "abc";
byte[] data = Encoding.BigEndianUnicode.GetBytes(x);

In other direction:

string decodedX = Encoding.BigEndianUnicode.GetString(data);

It is not exactly UCS-2 but it is enough for most cases.

UPD: Unicode FAQ

Q: What is the difference between UCS-2 and UTF-16?

A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.

UCS-2 does not define a distinct data format, because UTF-16 and UCS-2 are identical for purposes of data exchange. Both are 16-bit, and have exactly the same code unit representation.

Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc. for supplementary characters.

Better explain the different between UCS-2 / UTF-16 ... UTF-16 unicode extension A and B. UCS-2 support only Basic Multilingual Plane (BMP). — J-16 SDiZ, Aug 07 '11 at 08:34

Jon Skeet · Accepted Answer · 2011-08-07T08:46:02.687

2

EDIT: Now we know that the problem isn't in the encoding of the text data but in the encoding of the length. There are a few options:

Reverse the bytes and then use the built-in BitConverter code (which I assume is what you're using now; that or BinaryReader)
Perform the conversion yourself using repeated "add and shift" operations
Use my EndianBitConverter or EndianBinaryReader classes from MiscUtil, which are like BitConverter and BinaryReader, but let you specify the endianness.

You may be looking for Encoding.BigEndianUnicode. That's the big-endian UTF-16 encoding, which isn't strictly speaking the same as UCS-2 (as pointed out by Marc) but should be fine unless you give it strings including characters outside the BMP (i.e. above U+FFFF), which can't be represented in UCS-2 but are represented in UTF-16.

From the Wikipedia page:

The older UCS-2 (2-byte Universal Character Set) is a similar character encoding that was superseded by UTF-16 in version 2.0 of the Unicode standard in July 1996.2 It produces a fixed-length format by simply using the code point as the 16-bit code unit and produces exactly the same result as UTF-16 for 96.9% of all the code points in the range 0-0xFFFF, including all characters that had been assigned a value at that time.

I find it highly unlikely that the client system is sending you characters where there's a difference (which is basically the surrogate pairs, which are permanently reserved for that use anyway).

edited Aug 07 '11 at 08:46

answered Aug 07 '11 at 08:26

Jon Skeet

1,421,763
867
9,128
9,194

Or within the surrogate ranges. – Ignacio Vazquez-Abrams Aug 07 '11 at 08:34
@Ignacio: It's not clear to me whether you posted your comment before or after my edit... can you check again and see whether there's still anything to add? – Jon Skeet Aug 07 '11 at 08:39
As far as i know, all text should be normal characters. – RyanTimmons91 Aug 07 '11 at 08:40
@Merlin33069: Then I *strongly* suspect that the problem isn't where you think it is. But it's hard to say without a concrete example of the data involved. – Jon Skeet Aug 07 '11 at 08:42
Before, but it still stands; UCS-2 will look at surrogates as unknown characters but valid codepoints, whereas UTF-16 will choke if a proper surrogate pair isn't found. – Ignacio Vazquez-Abrams Aug 07 '11 at 08:42
@Ignacio: Doesn't that come under "characters where there's a difference (which is basically the surrogate pairs, which are permanently reserved for that use anyway)"? – Jon Skeet Aug 07 '11 at 08:47
Your endian converter in miscutils worked, Im hoping we can use it in the final project. – RyanTimmons91 Aug 07 '11 at 09:44

score 2 · Answer 3 · answered Aug 07 '11 at 08:44

UCS-2 is so close to UTF-16 that Encoding.BigEndianUnicode will almost always suffice.

The issue (comments) around reading the length prefix (as big-endian) is more correctly resolved via shift operations, which will do the right thing on all systems. For example:

Read4BytesIntoBuffer(buffer);
int len =(buffer[0] << 24) | (buffer[1] << 16) | (buffer[2] << 8) | (buffer[3]);

This will then work the same (at parsing a big-endian 4 byte int) on any system, regardless of local endianness.

C# big-endian UCS-2

3 Answers3

Linked