3

I'm trying to set up a very basic ZeroMQ-based socket link between Python server and C# client using simplejson and Json.NET. I try to send a dict from Python and read it into an object in C#. Python code:

message = {'MessageType':"None", 'ContentType':"None", 'Content':"OK"}
message_blob = simplejson.dumps(message).encode(encoding = "UTF-8")
alive_socket.send(message_blob)

The message is sent as normal UTF-8 string or, if I use UTF-16, as "'\xff\xfe{\x00"\x00..." etc.

Code in C# is where my problem is:

string reply = client.Receive(Encoding.UTF8);

The UTF-8 message is received as "≻潃瑮湥≴›..." etc.

I tried to use UTF-16 and the message comes through OK, but the first symbols are still the little-endian \xFF \xFE BOM so when I try to feed it to the deserializer,

PythonMessage replyMessage = JsonConvert.DeserializeObject<PythonMessage>(reply);
//PythonMessage is just a very simple class with properties,
//not relevant to the problem

I get an error (obviously occurring at the first symbol, \xFF):

Unexpected character encountered while parsing value: .

Something is obviously wrong in the way I'm using encoding. Can you please show me the right way to do this?

Alex Bausk
  • 690
  • 5
  • 29
  • Did you try sending a simple string, not a dict? Did that work? – raffian Oct 10 '13 at 23:06
  • It's the same with a string if I encode it like I encode the dumps() result. – Alex Bausk Oct 11 '13 at 07:10
  • What is this `Receive` method? Doesn't Socket.Receive accept only byte arrays? What is `client`? – Joni Oct 11 '13 at 07:16
  • It's the ZeroMQ client method to receive the string over sockets. Defined using ZeroMQ as using (ZmqContext context = ZmqContext.Create()) using (ZmqSocket client = context.CreateSocket(SocketType.REQ)) – Alex Bausk Oct 11 '13 at 07:17

1 Answers1

1

The byte-order-mark is obligatory in UTF-16. You can use UTF-16LE or UTF-16BE to assume a particular byte order and the BOM will not be generated. That is, use:

message_blob = simplejson.dumps(message).encode(encoding = "UTF-16le")
Joni
  • 108,737
  • 14
  • 143
  • 193
  • I was under the impression that FFFE is little-endian BOM? – Alex Bausk Oct 11 '13 at 07:05
  • I guess the question is, how do you get DeserializeObject to consume a UTF-16 string correctly? Should I convert it to UTF-8 (if yes, how?) or just throw away the two first bytes? The latter seems absolutely wrong to me. TIA! – Alex Bausk Oct 11 '13 at 07:07
  • You're right, I had it backwards, U+feff is the zero-width space character (used as bom), and \xff\xfe is the little endian bom – Joni Oct 11 '13 at 08:25
  • 1
    @AlexBausk, it looks like the problem is in how the zeromq client decodes the string: it does not remove the byte-order mark. Either you have to remove it yourself, or not generate it in the first place (see edit). Then again, using UTF-8 you could avoid the whole mess with BOMs. Looking at the [source code](https://github.com/zeromq/clrzmq/blob/master/src/ZeroMQ/SendReceiveExtensions.cs) I see no reason why the ZeroMQ client should ignore the encoding you pass in, are you sure you were running the latest version of the C# code when you tested? – Joni Oct 11 '13 at 08:59
  • I found a remotely similar problem somewhere on SO and people just .Remove() the byte order mark. Yep, it works now, thanks! I am still at loss as to why UTF-8 comes as rubbish though. Can you please recommend any good read or examples on handling/converting encodings in C#? I have almost zero experience with C#. – Alex Bausk Oct 11 '13 at 09:12
  • Regarding UTF-8, the `.encode()` part in `message_blob = simplejson.dumps(message).encode(encoding = "UTF-8")` is wrong (I initially added it because json didn't accept the string for some other reason), getting rid of it produces better results. The UTF-8 string received by C# is now readable but \0 bytes are inserted between symbols. Hope I can work it out on my own from here. – Alex Bausk Oct 11 '13 at 09:22
  • Originally you were sending UTF-8 it still got decoded as UTF-16; if you do the same in Python you get the same results:`print '{"Content": '.encode(encoding="utf-8").decode(encoding="utf-16")` gives `≻潃瑮湥≴›`. It sounds like now you have the exact opposite: you encode in utf-16 to send and decode as utf-8 on receival. – Joni Oct 11 '13 at 09:36
  • I don't have much experience in C# either but the MSDN is a good resource for most things, for example [this page on character encodings in .Net](http://msdn.microsoft.com/en-us/library/ms404377.aspx) and the documentation for the various Encoding classes in [System.Text](http://msdn.microsoft.com/en-us/library/System.Text.aspx) – Joni Oct 11 '13 at 09:39
  • Thanks! Right now I have UTF-16 -> UTF-16 transport, at least that much I can do. – Alex Bausk Oct 11 '13 at 12:01
  • For future reference: this topic contains information that sometimes software sends 3 BOMs and you have to strip them yourself. http://stackoverflow.com/questions/14181193/parse-json-c-sharp-error – Alex Bausk Oct 11 '13 at 13:09