Parse multiple json objects that are in one line

Question

I'm parsing files that containt json objects. The problem is that some files have multiple objects in one line. e.g.:

{"data1": {"data1_inside": "bla{bl\"a"}}{"data1": {"data1_inside": "blabla["}}{"data1": {"data1_inside": "bla{bla"}}{"data1": {"data1_inside": "bla["}}

I've made a function that tries parsing a substring when there are no open brackets left, but there may be curly brackets in values. I've tried skipping values with checking the start and end of quotes, but there are also values with escaped quotes. Any ideas on how to deal with this?

My attempt:

def get_lines(data):
    lines = []
    open_brackets = 0
    start = 0
    is_comment = False
    for index, c in enumerate(data):
        if c == '"':
            is_comment = not is_comment
        elif not is_comment:
            if c == '{':
                if not open_brackets:
                    start = index
                open_brackets += 1

            if c == '}':
                open_brackets -= 1
                if not open_brackets:
                    lines.append(data[start: index+1])

    return lines

What about using `json.loads` ? https://docs.python.org/2/library/json.html — JoseKilo, May 01 '16 at 13:32
For that to work, there needs to be one object parsed at a time. The same is with ujson. — , May 01 '16 at 13:36
Possible duplicate of [Parsing values from a JSON file in Python](http://stackoverflow.com/questions/2835559/parsing-values-from-a-json-file-in-python) — snakecharmerb, May 01 '16 at 13:42
This is interesting. Curious for a robust solution. Because, you know, you *could* have a string "}{" as a field in there, for example. — timgeb, May 01 '16 at 13:45
Brute force approach: Read the line character by character, try `json.loads` with the accumulated string after each character. If it succeeds, start over with accumulating characters. — timgeb, May 01 '16 at 13:46
@snakecharmerb no, my json is in a valid format, it just can't be parsed because the parsers takes only one object at a time. — , May 01 '16 at 13:49
@timgeb I like your approach but I think it could be improved by finding `}{` and try `json.loads` till that index and then iterate like you mentioned. — AKS, May 01 '16 at 13:50
I am trying that too but I think I am getting problems because of parentheses in `"bla{bl\"a"` — AKS, May 01 '16 at 13:53
The problem lies with `\"` because as soon as you enter it into another string then it is not escaped anymore because it is not `\\"` :) — AKS, May 01 '16 at 14:15
A sequence of JSON objects is itself not valid a valid JSON value. — chepner, May 01 '16 at 14:29
I strongly recommend wrapping all the JSON objects in an array as items: `[{...},{...},{...}]` and it will be valid JSON and wouldn't need special treatment (which is bound to break sometime) — casraf, May 01 '16 at 14:49
@casraf yes, and how to do so programmatically in a reliable way? That's kinda the question. — timgeb, May 01 '16 at 21:11

score 15 · Answer 1 · answered May 05 '17 at 14:15

You can use the json raw_decoder! This allows the reading of json strings with extra data after the first json object. An example of usage would be:

>>> dec = json.JSONDecoder()
>>> json_str = '{"data": "Foo"}{"data": "BarBaz"}{"data": "Qux"}'
>>> dec.raw_decode(json_str)
({u'data': u'Foo'}, 15)
>>> dec.raw_decode(json_str[15:])
({u'data': u'BarBaz'}, 18)
>>> dec.raw_decode(json_str[33:])
({u'data': u'Qux'}, 15)

The first part of the tuple is the json object, the second is how much of the string was used when reading it. Therefore a loop like this will allow you to iterate over all the json objects in a string.

dec = json.JSONDecoder()
pos = 0
while not pos == len(str(json_str)):
    j, json_len = dec.raw_decode(str(json_str)[pos:])
    pos += json_len
    # Do something with the json j here

⁺¹, but I should note the loop in the last paragraph is too brittle. It would fail if you have a trailing whitespace in the string or some trailing character *(for example because not all data was accepted yet from the network)* — Hi-Angel, Jun 01 '20 at 13:23

timgeb · Answer 2 · 2016-05-01T14:26:16.433

4

The problem is that you can't reasonably split by any character or sequence of characters, because that sequence could always show up in a string as a field value, for example '{"data1": "}{"}{"data2":"foo"}'.

If we assume that every substring in your file/string that is valid JSON must start with '{' and end with '}' (of course, in the general case we'd also have to deal with '[' and ']' characters), here's a brute force approach:

import json

with open('input.txt') as inp:
    s = inp.read().strip()

jsons = []

start, end = s.find('{'), s.find('}')
while True:
    try:
        jsons.append(json.loads(s[start:end + 1]))
    except ValueError:
        end = end + 1 + s[end + 1:].find('}')
    else:
        s = s[end + 1:]
        if not s:
            break
        start, end = s.find('{'), s.find('}')

for x  in jsons:
    print(x)

Demo:

$ cat input.txt 
{"data1": {"data1_inside": "bla{bl\"a"}}{"data1": {"data1_inside": "blabla["}}{"data1": {"data1_inside": "bla{bla"}}{"data1": {"data1_inside": "bla["}}
$ python json_linereader.py 
{u'data1': {u'data1_inside': u'bla{bl"a'}}
{u'data1': {u'data1_inside': u'blabla['}}
{u'data1': {u'data1_inside': u'bla{bla'}}
{u'data1': {u'data1_inside': u'bla['}}

Output for s = '{"data1": "}{"}{"data2":"foo"}'

{'data1': '}{'}
{'data2': 'foo'}

I haven't checked this code for all eventualities with unit tests, but the idea should be clear.

edited May 01 '16 at 14:26

answered May 01 '16 at 14:16

timgeb

76,762
20
123
145

I again suggest to search for `}{` to reduce the number of iterations. – AKS May 01 '16 at 14:26
@AKS But... I'm doing this? For OP's string, the code tries `json.loads` only 8 times. – timgeb May 01 '16 at 14:28
I didn't understand what you mean. You are searching for `{` and `}` separately. What I am saying is to directly search for `"}{"` sequence of this. – AKS May 01 '16 at 14:30
1

@AKS I don't understand what you mean either. Every closing `'}'` could end the current JSON substring. You have to try them all without writing an elaborate parser. You don't have to search for `'}{'`. There could be spaces in between the `'}'` and `'{'` and what about the last substring? – timgeb May 01 '16 at 14:33
Oh. now I see. The last substring could be handled. But I do agree with you on the spaces in-between. – AKS May 01 '16 at 14:35
@AKS of course, you could use regex to check for `'}\s*{'` or `'}\s*$'`, but I don't think that's needed here :) – timgeb May 01 '16 at 14:36
I've chosen Francesco's answer because it's faster. But thank you guys for your help! – May 02 '16 at 22:48

Francesco · Accepted Answer · 2016-05-01T14:45:25.000

Simple but less robust version:

>>> import re
>>> s = r'{"data1": {"data1_inside": "bla{bl\"a"}}{"data1": {"data1_inside": "blabla["}}{"data1": {"data1_inside": "bla{bla"}}{"data1": {"data1_inside": "bla["}}'
>>> r = re.split('(\{.*?\})(?= *\{)', s)
['', '{"data1": {"data1_inside": "bla{bl\\"a"}}', '', '{"data1": {"data1_inside": "blabla["}}', '', '{"data1": {"data1_inside": "bla{bla"}}', '{"data1": {"data1_inside": "bla["}}']

This will fail if }{ is contained in a string

As other suggested, you could then try to parse each element. If it's not valid, then we should check this element together with the next one.

Note that r is the result of the code above

accumulator = ''
res = []
for subs in r:
    accumulator += subs
    try:
        res.append(json.loads(accumulator))
        accumulator = ''
    except:
        pass

score 1 · Answer 4 · answered Feb 07 '22 at 12:52

Building on Carcophan's answer and Hi-angel's comment I added a check to see if the leftover part of the string is whitespace. Python strings have a isspace() method so I used that to check.

NOTE: This is python3 -- I think the json library returned a different exception in eariler verisons.

dec = json.JSONDecoder()
pos = 0
results = []
while pos != len(str(json_str)):
    try:
        j, json_len = dec.raw_decode(str(json_str)[pos:])
    except json.decoder.JSONDecodeError as exc:
        if len(results) > 0 and str(json_str)[pos:].isspace():
            break
        raise exc
    else:
        pos += json_len
        results.append(j)

Although you could just `json_str.strip()` first and avoid the checking for `isspace()` in the except block (and not have that try/except at all). — Joao Coelho, Mar 31 '22 at 00:00

score 0 · Answer 5 · answered Mar 31 '22 at 03:12

Slight improvement over @Carcophan's answer, to handle spaces before, after and between JSON objects:

import json

decoder = json.JSONDecoder()
pos = 0
objs = []
while pos < len(json_str):
    json_str = json_str[pos:].strip()
    if not json_str:
        break  # Blank line case
    obj, pos = decoder.raw_decode(json_str)
    objs.append(obj)

One thing to remember is that JSON objects can be lists, strings, numbers, etc., meaning they don't necessarily start/end with {}. That means the answers here that rely on a string that contains {} for the start/end of JSON objects become unnecessarily limited to dictionaries.

This piece of code deals with any type of JSON object and blank chars (space, tabs, newlines).

Parse multiple json objects that are in one line

5 Answers5

Linked