2

The incoming data resembles the following:

[{
    "foo": "bar"
}]
[{
    "bar": "baz"
}]
[{
    "baz": "foo"
}]

as you see, arrays of objects strung together. JSON-ish

ijson is able to handle the first array, and then I get:

ijson.common.JSONError: Additional data

when it hits the subsequent arrays. How do I get around this?

Mr-IDE
  • 7,051
  • 1
  • 53
  • 59
Carl Sagan
  • 982
  • 1
  • 13
  • 34

2 Answers2

1

Here's a first cut at the problem that at least has a working regex substitution to turn a full string into valid json. It only works if you're ok with reading the full input stream before parsing as json.

import re

input = ''
for line in inputStream:
  input = input + line    
# input == '[{"foo": "bar"}][{"bar": "baz"}][{"baz": "foo"}]'

# wrap in [] and put commas between each ][
sanitizedInput = re.sub(r"\]\[", "],[", "[%s]" % input)
# sanitizedInput == '[[{"foo": "bar"}],[{"bar": "baz"}],[{"baz": "foo"}]]'

# then parse sanitizedInput
parsed = json.loads(sanitizedInput)
print parsed #=> [[{u'foo': u'bar'}], [{u'bar': u'baz'}], [{u'baz': u'foo'}]]

Note: since you're read the whole thing as a string, you can use json instead of ijson

alexanderbird
  • 3,847
  • 1
  • 26
  • 35
  • I think this is a good start for the solution. I would like to add that, the `inputStream` from Kafka might be coming in real time and the variable `input` would have to wait until it has read all the values from the Kafka broker. We could use `sanitizedInput = re.sub(r"\]\[", "],[", "[%s]" % line)` for each `line`. That should resolve the issue. – KartikKannapur Dec 11 '15 at 08:08
  • does each line contain `[{"foo": "bar"}]`? Or are each of those split over three lines like in your question? – alexanderbird Dec 11 '15 at 08:09
  • because if each line is it's own array, you could parse each line individually as it comes in and append it to an array. – alexanderbird Dec 11 '15 at 08:32
  • Yes I intended to suggest the same i.e. each line could be individually parsed. @CarlSagan Could you give us more details on this? – KartikKannapur Dec 11 '15 at 08:41
0

You can use json.JSONDecoder.raw_decode to walk through the string. Its documentation indeed says:

This can be used to decode a JSON document from a string that may have extraneous data at the end.

The following code sample assumes all the JSON values are in one big string:

def json_elements(string):
    while True:
        try:
            (element, position) = json.JSONDecoder.raw_decode(string)
            yield element
            string = string[position:]
        except ValueError:
            break

To avoid dealing with raw_decode yourself and to be able to parse a stream chunk by chunk, I would recommend a library I made for this exact purpose: streamcat.

def json_elements(stream)
    decoder = json.JSONDecoder()
    yield from streamcat.stream_to_iterator(stream, decoder)

This works for any concatenation of JSON values regardless of how many white-space characters are used within them or between them.

If you have control over how your input stream is encoded, you may want to consider using line-delimited JSON, which makes parsing easier.

bbc
  • 240
  • 2
  • 8