11

I'm parsing files that containt json objects. The problem is that some files have multiple objects in one line. e.g.:

{"data1": {"data1_inside": "bla{bl\"a"}}{"data1": {"data1_inside": "blabla["}}{"data1": {"data1_inside": "bla{bla"}}{"data1": {"data1_inside": "bla["}}

I've made a function that tries parsing a substring when there are no open brackets left, but there may be curly brackets in values. I've tried skipping values with checking the start and end of quotes, but there are also values with escaped quotes. Any ideas on how to deal with this?

My attempt:

def get_lines(data):
    lines = []
    open_brackets = 0
    start = 0
    is_comment = False
    for index, c in enumerate(data):
        if c == '"':
            is_comment = not is_comment
        elif not is_comment:
            if c == '{':
                if not open_brackets:
                    start = index
                open_brackets += 1

            if c == '}':
                open_brackets -= 1
                if not open_brackets:
                    lines.append(data[start: index+1])

    return lines
  • What about using `json.loads` ? https://docs.python.org/2/library/json.html – JoseKilo May 01 '16 at 13:32
  • 1
    For that to work, there needs to be one object parsed at a time. The same is with ujson. –  May 01 '16 at 13:36
  • Possible duplicate of [Parsing values from a JSON file in Python](http://stackoverflow.com/questions/2835559/parsing-values-from-a-json-file-in-python) – snakecharmerb May 01 '16 at 13:42
  • This is interesting. Curious for a robust solution. Because, you know, you *could* have a string "}{" as a field in there, for example. – timgeb May 01 '16 at 13:45
  • 1
    Brute force approach: Read the line character by character, try `json.loads` with the accumulated string after each character. If it succeeds, start over with accumulating characters. – timgeb May 01 '16 at 13:46
  • @snakecharmerb no, my json is in a valid format, it just can't be parsed because the parsers takes only one object at a time. –  May 01 '16 at 13:49
  • @timgeb I like your approach but I think it could be improved by finding `}{` and try `json.loads` till that index and then iterate like you mentioned. – AKS May 01 '16 at 13:50
  • @AKS yeah, currently trying to code that :) – timgeb May 01 '16 at 13:52
  • I am trying that too but I think I am getting problems because of parentheses in `"bla{bl\"a"` – AKS May 01 '16 at 13:53
  • @AKS yeah, is that even valid json? – timgeb May 01 '16 at 14:13
  • The problem lies with `\"` because as soon as you enter it into another string then it is not escaped anymore because it is not `\\"` :) – AKS May 01 '16 at 14:15
  • That's the problem! How to get rid of an escaped quote :) –  May 01 '16 at 14:17
  • A sequence of JSON objects is itself not valid a valid JSON value. – chepner May 01 '16 at 14:29
  • @chepner Yes, that's kinda the point. – timgeb May 01 '16 at 14:37
  • 1
    I strongly recommend wrapping all the JSON objects in an array as items: `[{...},{...},{...}]` and it will be valid JSON and wouldn't need special treatment (which is bound to break sometime) – casraf May 01 '16 at 14:49
  • @casraf yes, and how to do so programmatically in a reliable way? That's kinda the question. – timgeb May 01 '16 at 21:11

5 Answers5

15

You can use the json raw_decoder! This allows the reading of json strings with extra data after the first json object. An example of usage would be:

>>> dec = json.JSONDecoder()
>>> json_str = '{"data": "Foo"}{"data": "BarBaz"}{"data": "Qux"}'
>>> dec.raw_decode(json_str)
({u'data': u'Foo'}, 15)
>>> dec.raw_decode(json_str[15:])
({u'data': u'BarBaz'}, 18)
>>> dec.raw_decode(json_str[33:])
({u'data': u'Qux'}, 15)

The first part of the tuple is the json object, the second is how much of the string was used when reading it. Therefore a loop like this will allow you to iterate over all the json objects in a string.

dec = json.JSONDecoder()
pos = 0
while not pos == len(str(json_str)):
    j, json_len = dec.raw_decode(str(json_str)[pos:])
    pos += json_len
    # Do something with the json j here
Carcophan
  • 1,508
  • 2
  • 18
  • 38
  • 1
    ⁺¹, but I should note the loop in the last paragraph is too brittle. It would fail if you have a trailing whitespace in the string or some trailing character *(for example because not all data was accepted yet from the network)* – Hi-Angel Jun 01 '20 at 13:23
4

The problem is that you can't reasonably split by any character or sequence of characters, because that sequence could always show up in a string as a field value, for example '{"data1": "}{"}{"data2":"foo"}'.

If we assume that every substring in your file/string that is valid JSON must start with '{' and end with '}' (of course, in the general case we'd also have to deal with '[' and ']' characters), here's a brute force approach:

import json

with open('input.txt') as inp:
    s = inp.read().strip()

jsons = []

start, end = s.find('{'), s.find('}')
while True:
    try:
        jsons.append(json.loads(s[start:end + 1]))
    except ValueError:
        end = end + 1 + s[end + 1:].find('}')
    else:
        s = s[end + 1:]
        if not s:
            break
        start, end = s.find('{'), s.find('}')

for x  in jsons:
    print(x)

Demo:

$ cat input.txt 
{"data1": {"data1_inside": "bla{bl\"a"}}{"data1": {"data1_inside": "blabla["}}{"data1": {"data1_inside": "bla{bla"}}{"data1": {"data1_inside": "bla["}}
$ python json_linereader.py 
{u'data1': {u'data1_inside': u'bla{bl"a'}}
{u'data1': {u'data1_inside': u'blabla['}}
{u'data1': {u'data1_inside': u'bla{bla'}}
{u'data1': {u'data1_inside': u'bla['}}

Output for s = '{"data1": "}{"}{"data2":"foo"}'

{'data1': '}{'}
{'data2': 'foo'}

I haven't checked this code for all eventualities with unit tests, but the idea should be clear.

timgeb
  • 76,762
  • 20
  • 123
  • 145
  • I again suggest to search for `}{` to reduce the number of iterations. – AKS May 01 '16 at 14:26
  • @AKS But... I'm doing this? For OP's string, the code tries `json.loads` only 8 times. – timgeb May 01 '16 at 14:28
  • I didn't understand what you mean. You are searching for `{` and `}` separately. What I am saying is to directly search for `"}{"` sequence of this. – AKS May 01 '16 at 14:30
  • 1
    @AKS I don't understand what you mean either. Every closing `'}'` could end the current JSON substring. You have to try them all without writing an elaborate parser. You don't have to search for `'}{'`. There could be spaces in between the `'}'` and `'{'` and what about the last substring? – timgeb May 01 '16 at 14:33
  • Oh. now I see. The last substring could be handled. But I do agree with you on the spaces in-between. – AKS May 01 '16 at 14:35
  • @AKS of course, you could use regex to check for `'}\s*{'` or `'}\s*$'`, but I don't think that's needed here :) – timgeb May 01 '16 at 14:36
  • I've chosen Francesco's answer because it's faster. But thank you guys for your help! –  May 02 '16 at 22:48
3

Simple but less robust version:

>>> import re
>>> s = r'{"data1": {"data1_inside": "bla{bl\"a"}}{"data1": {"data1_inside": "blabla["}}{"data1": {"data1_inside": "bla{bla"}}{"data1": {"data1_inside": "bla["}}'
>>> r = re.split('(\{.*?\})(?= *\{)', s)
['', '{"data1": {"data1_inside": "bla{bl\\"a"}}', '', '{"data1": {"data1_inside": "blabla["}}', '', '{"data1": {"data1_inside": "bla{bla"}}', '{"data1": {"data1_inside": "bla["}}']

This will fail if }{ is contained in a string

As other suggested, you could then try to parse each element. If it's not valid, then we should check this element together with the next one.

Note that r is the result of the code above

accumulator = ''
res = []
for subs in r:
    accumulator += subs
    try:
        res.append(json.loads(accumulator))
        accumulator = ''
    except:
        pass
Francesco
  • 4,052
  • 2
  • 21
  • 29
1

Building on Carcophan's answer and Hi-angel's comment I added a check to see if the leftover part of the string is whitespace. Python strings have a isspace() method so I used that to check.

NOTE: This is python3 -- I think the json library returned a different exception in eariler verisons.

dec = json.JSONDecoder()
pos = 0
results = []
while pos != len(str(json_str)):
    try:
        j, json_len = dec.raw_decode(str(json_str)[pos:])
    except json.decoder.JSONDecodeError as exc:
        if len(results) > 0 and str(json_str)[pos:].isspace():
            break
        raise exc
    else:
        pos += json_len
        results.append(j)
Kevin Joyce
  • 91
  • 1
  • 4
  • Although you could just `json_str.strip()` first and avoid the checking for `isspace()` in the except block (and not have that try/except at all). – Joao Coelho Mar 31 '22 at 00:00
0

Slight improvement over @Carcophan's answer, to handle spaces before, after and between JSON objects:

import json

decoder = json.JSONDecoder()
pos = 0
objs = []
while pos < len(json_str):
    json_str = json_str[pos:].strip()
    if not json_str:
        break  # Blank line case
    obj, pos = decoder.raw_decode(json_str)
    objs.append(obj)

One thing to remember is that JSON objects can be lists, strings, numbers, etc., meaning they don't necessarily start/end with {}. That means the answers here that rely on a string that contains {} for the start/end of JSON objects become unnecessarily limited to dictionaries.

This piece of code deals with any type of JSON object and blank chars (space, tabs, newlines).

Joao Coelho
  • 2,838
  • 4
  • 30
  • 36