AWS Sagemaker output how to read file with multiple json objects spread out over multiple lines

Question

I have a bunch of json files that look like this

{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664, ], "word": "blah blah blah"}
{"vector": [0.01027186680585146, 0.04181386157870293, -0.07363887131214142, ], "word": "blah blah blah"}
{"vector": [0.011699287220835686, 0.04741542786359787, -0.07899319380521774, ], "word": "blah blah blah"}

Which I can read in with

f = open(file_name)
data = []
for line in f:
   data.append(json.dumps(line))

But I have another file with output like this

{
    "predictions": [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]
    ]
}
{
    "predictions": [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]
    ]
}
{
    "predictions": [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]
    ]
}

I.e. the json is formatted over several lines, so I can't simply read the json in line for line. Is there an easy way to parse this? Or do I have to write something that stitches together each json object line by line and the does json.loads?

rv.kvetch · Accepted Answer · 2021-10-31T04:49:02.577

Hmm, as far as I know there's unfortunately no way to load a JSONL format data using json.loads. One option though, is to come up with a helper function that can convert it to a valid JSON string, as below:

import json

string = """
{
    "predictions": [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]
    ]
}
{
    "predictions": [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]
    ]
}
{
    "predictions": [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]
    ]
}
"""


def json_lines_to_json(s: str) -> str:
    # replace the first occurrence of '{'
    s = s.replace('{', '[{', 1)

    # replace the last occurrence of '}
    s = s.rsplit('}', 1)[0] + '}]'

    # now go in and replace all occurrences of '}' immediately followed
    # by newline with a '},'
    s = s.replace('}\n', '},\n')

    return s


print(json.loads(json_lines_to_json(string)))

Prints:

[{'predictions': [[0.875780046, 0.124219939], [0.892282844, 0.107717164], [0.887681246, 0.112318777]]}, {'predictions': [[0.0, 1.0], [0.0, 1.0], [0.0, 1.0]]}, {'predictions': [[0.391415, 0.608585], [0.992118478, 0.00788147748], [0.0, 1.0]]}]

Note: your first example actually doesn't seem like valid JSON (or at least JSON lines from my understanding). In particular, this part appears to be invalid due to a trailing comma after the last array element:

{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664, ], ...}

To ensure it's valid after calling the helper function, you'd also need to remove the trailing commas, so each line is in the below format:

{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664 ], ...},

There also appears to be a similar question where they suggest splitting on newlines and calling json.loads on each line; actually it should be (slightly) less performant to call json.loads multiple times on each object, rather than once on the list, as I show below.

from timeit import timeit
import json


string = """\
{"vector": [0.017906909808516502, 0.052080217748880386, -0.1460590809583664 ], "word": "blah blah blah"}
{"vector": [0.01027186680585146, 0.04181386157870293, -0.07363887131214142 ], "word": "blah blah blah"}
{"vector": [0.011699287220835686, 0.04741542786359787, -0.07899319380521774 ], "word": "blah blah blah"}\
"""


def json_lines_to_json(s: str) -> str:

    # Strip newlines from end, then replace all occurrences of '}' followed
    # by a newline, by a '},' followed by a newline.
    s = s.rstrip('\n').replace('}\n', '},\n')

    # return string value wrapped in brackets (list)
    return f'[{s}]'


n = 10_000

print('string replace:        ', timeit(r'json.loads(json_lines_to_json(string))', number=n, globals=globals()))
print('json.loads each line:  ', timeit(r'[json.loads(line) for line in string.split("\n")]', number=n, globals=globals()))

Result:

string replace:         0.07599360000000001
json.loads each line:   0.1078384

Thank you for confirming. Wanted to see if I could do it a cleaner way, but looks like I'll have to do a helper function. Thanks for the suggestions — L Xandor, Oct 31 '21 at 01:04
@LXandor No problem, glad I could help out. I was also doing a quick google search and came across this [other question](https://stackoverflow.com/questions/50475635/loading-jsonl-file-as-json-objects/50475669) that looks like was also asking about reading in JSONL data, but they suggest a different approach there. I also updated my post to show that approach, which is only slightly less efficient, but it's also much easier to do it that way as well. — rv.kvetch, Oct 31 '21 at 04:45

AWS Sagemaker output how to read file with multiple json objects spread out over multiple lines

1 Answers1