python: how do I parse a stream of json arrays with ijson library

Question

The incoming data resembles the following:

[{
    "foo": "bar"
}]
[{
    "bar": "baz"
}]
[{
    "baz": "foo"
}]

as you see, arrays of objects strung together. JSON-ish

ijson is able to handle the first array, and then I get:

ijson.common.JSONError: Additional data

when it hits the subsequent arrays. How do I get around this?

your problem is not much clear.. If i get it correct, please try this.. http://stackoverflow.com/questions/12344332/parsing-muilti-dimensional-json-array-to-python — Sandaru, Dec 11 '15 at 06:05
No, do you notice that the JSON is actually invalid JSON? there is no root object, and no comma (,) between the arrays. — Carl Sagan, Dec 11 '15 at 06:06
ahaa then why cant we use a Regex to decode it as a string :) — Sandaru, Dec 11 '15 at 06:08
@CarlSagan What is the data source? Parsing this would be a problem without the comma(,) anyway .. could you share some more information? — KartikKannapur, Dec 11 '15 at 06:10
Yes, datasource is a Kafka publisher. I am using python kafka consumer to retrieve topic message stream. — Carl Sagan, Dec 11 '15 at 06:11
@CarlSagan In your Kafka consumer, what is the `type` of each of these messages? Is each `[{ "foo": "bar"}]` a string ? — KartikKannapur, Dec 11 '15 at 06:16

alexanderbird · Answer 1 · 2015-12-11T07:59:43.233

1

Here's a first cut at the problem that at least has a working regex substitution to turn a full string into valid json. It only works if you're ok with reading the full input stream before parsing as json.

import re

input = ''
for line in inputStream:
  input = input + line    
# input == '[{"foo": "bar"}][{"bar": "baz"}][{"baz": "foo"}]'

# wrap in [] and put commas between each ][
sanitizedInput = re.sub(r"\]\[", "],[", "[%s]" % input)
# sanitizedInput == '[[{"foo": "bar"}],[{"bar": "baz"}],[{"baz": "foo"}]]'

# then parse sanitizedInput
parsed = json.loads(sanitizedInput)
print parsed #=> [[{u'foo': u'bar'}], [{u'bar': u'baz'}], [{u'baz': u'foo'}]]

Note: since you're read the whole thing as a string, you can use json instead of ijson

edited Dec 11 '15 at 07:59

answered Dec 11 '15 at 06:23

alexanderbird

3,847
1
26
35

I think this is a good start for the solution. I would like to add that, the `inputStream` from Kafka might be coming in real time and the variable `input` would have to wait until it has read all the values from the Kafka broker. We could use `sanitizedInput = re.sub(r"\]\[", "],[", "[%s]" % line)` for each `line`. That should resolve the issue. – KartikKannapur Dec 11 '15 at 08:08
does each line contain `[{"foo": "bar"}]`? Or are each of those split over three lines like in your question? – alexanderbird Dec 11 '15 at 08:09
because if each line is it's own array, you could parse each line individually as it comes in and append it to an array. – alexanderbird Dec 11 '15 at 08:32
Yes I intended to suggest the same i.e. each line could be individually parsed. @CarlSagan Could you give us more details on this? – KartikKannapur Dec 11 '15 at 08:41

score 0 · Answer 2 · answered Oct 27 '18 at 10:52

You can use json.JSONDecoder.raw_decode to walk through the string. Its documentation indeed says:

This can be used to decode a JSON document from a string that may have extraneous data at the end.

The following code sample assumes all the JSON values are in one big string:

def json_elements(string):
    while True:
        try:
            (element, position) = json.JSONDecoder.raw_decode(string)
            yield element
            string = string[position:]
        except ValueError:
            break

To avoid dealing with raw_decode yourself and to be able to parse a stream chunk by chunk, I would recommend a library I made for this exact purpose: streamcat.

def json_elements(stream)
    decoder = json.JSONDecoder()
    yield from streamcat.stream_to_iterator(stream, decoder)

This works for any concatenation of JSON values regardless of how many white-space characters are used within them or between them.

If you have control over how your input stream is encoded, you may want to consider using line-delimited JSON, which makes parsing easier.

python: how do I parse a stream of json arrays with ijson library

2 Answers2

Linked