0

I am accepting some binary data from a websocket.

I am trying to do json.loads(data) however I get a ValueError thrown

Printing it I get the following result (which is all valid json):

{"session":"SeFKQ0SfYZqhh6FTCcKZGw==","authenticate":1,"id":1791}

but when I inspected the string further, the print was turning this monstrosity into the json above:

'{\x00"\x00s\x00e\x00s\x00s\x00i\x00o\x00n\x00"\x00:\x00"\x00S\x00e
\x00F\x00K\x00Q\x000\x00S\x00f\x00Y\x00Z\x00q\x00h\x00h\x006\x00F
\x00T\x00C\x00c\x00K\x00Z\x00G\x00w\x00=\x00=\x00"\x00,\x00"\x00a
\x00u\x00t\x00h\x00e\x00n\x00t\x00i\x00c\x00a\x00t\x00e\x00"\x00:
\x001\x00,\x00"\x00t\x00h\x00r\x00e\x00a\x00d\x00_\x00i\x00d\x00"
\x00:\x001\x007\x009\x001\x00}\x00'

What is this coming back and how can I do something meaningful (turning it into a native dictionary via json.loads) with it?

tipu
  • 9,464
  • 15
  • 65
  • 98

1 Answers1

5

Your data appears to be UTF-16 encoded, little-endian with no BOM (byte-order mark).

I would try first decoding it with the utf16-le decoder:

data = data.decode('utf-16le')

And then load it with json.loads(data).

data = '{\x00"\x00s\x00e\x00s\x00s\x00i\x00o\x00n\x00"\x00:\x00"\x00S\x00e\x00F\x00K\x00Q\x000\x00S\x00f\x00Y\x00Z\x00q\x00h\x00h\x006\x00F\x00T\x00C\x00c\x00K\x00Z\x00G\x00w\x00=\x00=\x00"\x00,\x00"\x00a\x00u\x00t\x00h\x00e\x00n\x00t\x00i\x00c\x00a\x00t\x00e\x00"\x00:\x001\x00,\x00"\x00t\x00h\x00r\x00e\x00a\x00d\x00_\x00i\x00d\x00"\x00:\x001\x007\x009\x001\x00}\x00'
data = data.decode('utf16-le')
print json.loads(data)

Output:

{u'thread_id': 1791, u'session': u'SeFKQ0SfYZqhh6FTCcKZGw==', u'authenticate': 1}
Jonathon Reinhart
  • 132,704
  • 33
  • 254
  • 328
  • 1
    how did you determine the encoding? – tipu Mar 16 '15 at 06:15
  • Please consider adding more details for the tipu's question – backtrack Mar 16 '15 at 06:18
  • 1
    @tipu Experience, mostly. I noticed that every other byte, starting with the second byte in the stream was `00`. That meant every character was encoded as two bytes, in little-endian (least significant byte first) order. I also noticed there was no BOM at the beginning. Then I consulted [this answer](http://stackoverflow.com/a/8827604/119527) to remind me which decoder was appropriate. – Jonathon Reinhart Mar 16 '15 at 06:18
  • The better question is, Where did your data come from? Did they have some method of indicating that it would be UTF-16 encoded? – Jonathon Reinhart Mar 16 '15 at 06:19
  • @JonathonReinhart i should have looked at what the javascript web socket is sending when it is specified a binary protocol in it's messaging. – tipu Mar 16 '15 at 06:23