2

A socket receives a JSON formatted string and might receive more than one which will result in a variable containing something like this:

{'a':'1','b':'44'}{'a':'1','b':'44'}

As you can see, it is multiple JSON strings in one variable. How can I decode these in Python?

I mean, is there a way in Python to decode the two JSON strings into an array, or just a way to know there might be two strings in the output?

Using new lines to split them is not a good idea as the data might actually have new lines.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Rami Dabain
  • 4,709
  • 12
  • 62
  • 106
  • How'd the socket receive a string like `"{'a':'1','b':'44'}{'a':'1','b':'44'}"`? – hjpotter92 Feb 18 '14 at 13:53
  • Thought about writing a simple parser (possibly using regexes) based on [json grammar](http://www.json.org/)? – Maciej Gol Feb 18 '14 at 13:57
  • 1
    @hjpotter92, that's easy. Just write two json strings into a stream without a delimiter. – Alfe Feb 18 '14 at 13:57
  • Do you have any control over the outputted format of the socket? If you could format the out like `[{'a':'1','b':'44'},{'a':'1','b':'44'}]` it would be valid `JSON` and could be parsed by a `JSON`-parser. – kaspermoerch Feb 18 '14 at 14:00
  • multiple threads sen through same socket, and it happens that up to 200 threads send at the same time ... maybe i'll send a delimiter like 'YT&^^%Fe54&^Rh8R%R' ? that would be impossible to have in the json ... i guess lol – Rami Dabain Feb 18 '14 at 14:00
  • Since json can contain arbitrary depths of nested parentheses, this will be a problem using regexpx. They cannot replace a proper json lexer. They could provide a decent tokenizer of course and you could lex that stuff yourself, but that would mean rewrite a whole json parser, more or less. – Alfe Feb 18 '14 at 14:01
  • If you are using threads to send the data through the same socket, consider using synchronization mechanism. Otherwise, your json data might interleave. – Maciej Gol Feb 18 '14 at 14:01
  • @KasperMoerch yes i have control, but it would be a pain in the thing to implement! as there are about 2000 threads who use the same socket to send the data – Rami Dabain Feb 18 '14 at 14:02
  • @Alfe, I've meant a simple parser using regex for tokenizing input. Sorry for being not specific enough. The point is that the grammar itself is simple, the regexes for tokens easy, and in return you'd get a parser that would tell you json objects' boundaries. – Maciej Gol Feb 18 '14 at 14:03
  • @kroolik you're right ... though packets are ordered this probably won't happen (didn't happen in the past 3 days ... am watching for that) might look into that as it will let me implement Kaspers solution ... but still want to know if there's less problems with that – Rami Dabain Feb 18 '14 at 14:04
  • @RonanDejhero, consider sending something like `socket.sendall(json.dumps(obj) + '%!(JSON_DELIMITER)')`. This way you just need to split on the `%!(JSON_DELIMITER)` string. – Maciej Gol Feb 18 '14 at 14:04
  • @RonanDejhero, the packets themselves might be ordered, but what about when 10 threads decide to `sendall` a really big json object - one that doesn't fit into internal socket buffer. – Maciej Gol Feb 18 '14 at 14:06

1 Answers1

1

You can use the standard JSON parser and make use of the descriptive exception it throws when there is extra data behind the proper JSON string.

Currently (that is, my version of the JSON parser) throws a ValueError with a message looking like this: "Extra data: line 3 column 1 - line 3 column 6 (char 5 - 10)".

The number 5 in this case (you can parse that out of the message easily with a regular expression) provides the information where the parsing failed. So if you get that exception, you can parse a substring of your original input, namely everything up to the character before that, and afterwards (I propose recursively) parse the rest.

import json, re

def jsonMultiParse(s):
  try:
    return json.loads(s)
  except ValueError as problem:
    m = re.match(
      r'Extra data: line \d+ column \d+ - line \d+ column \d+ .char (\d+) - \d+.',
      problem.message)
    if not m:
      raise
    extraStart = int(m.group(1))
    return json.loads(s[:extraStart]), jsonMultiParse(s[extraStart:])

print jsonMultiParse('{}[{}]    \n\n["foo", 3]')

Will print:

({}, ([{}], [u'foo', 3]))

In case you prefer to get a straight tuple instead of a nested one:

    return (json.loads(s),)

and

    return (json.loads(s[:extraStart]),) + jsonMultiParse(s[extraStart:])

Return:

({}, [{}], [u'foo', 3])
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Alfe
  • 56,346
  • 20
  • 107
  • 159
  • The main problem with this approach is that it is dependent on the format of the Exception being thrown. I. e. it can change with every new version or cease to function altogether. A proper exception which provided this information in a better fashion (a field) would be nicer of course. And even nicer would it be to get the first completed value _and_ the information how many characters it used. But we have to live with what we've got. – Alfe Feb 18 '14 at 14:18
  • that is smart! i believe it should be implemented in the JSON module it self `json.multiparse` that would be more sufficient. i'll try it now – Rami Dabain Feb 18 '14 at 14:28
  • Well, smart would be to implement it in the Json module because by catching and retrying you effectively parse everything more than once. – Alfe Feb 18 '14 at 14:37
  • 1
    Oh, I just saw in Martijn's post (which is given as the answer to the potential duplicate) that you can use `JSONDecoder().raw_decode()` which returns exactly what I wanted: The first Json object and the number of bytes consumed. With this, our function can be without the ugly catching and retrying. – Alfe Feb 18 '14 at 14:40
  • Yes but that won't work if there are new lines or any other rubbish data, which can be solved in this piece of code with some editing :) (looking for the first { after the "extra data" portion) – Rami Dabain Feb 18 '14 at 15:47
  • I think it's better to use the `raw_decode()` and strip the whitespace manually (because that seems all that raises the problem). You even can pass a second argument `idx` to tell the function where to _start_ decoding, so there is no need to create substrings (which is costly). – Alfe Feb 18 '14 at 22:01