38

I'm trying to load a large file (2GB in size) filled with JSON strings, delimited by newlines. Ex:

{
    "key11": value11,
    "key12": value12,
}
{
    "key21": value21,
    "key22": value22,
}
…

The way I'm importing it now is:

content = open(file_path, "r").read() 
j_content = json.loads("[" + content.replace("}\n{", "},\n{") + "]")

Which seems like a hack (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list).

Is there a better way to specify the JSON delimiter (newline \n instead of comma ,)?

Also, Python can't seem to properly allocate memory for an object built from 2GB of data, is there a way to construct each JSON object as I'm reading the file line by line? Thanks!

Cat
  • 7,042
  • 8
  • 34
  • 36
  • 1
    just read each line and construct a json object at this time – njzk2 Feb 03 '14 at 17:42
  • @njzk2: I think the problem is that there are newlines inside the JSON objects, not just between them, right? – Arkady Feb 03 '14 at 17:48
  • there are newlines between the JSON objects, and inside of them, yes. The replace function works because the only places where a newline separates a closing and opening curly brace ("}" and "{") is between objects. I'd still like to not rely on it to load the JSON. – Cat Feb 03 '14 at 17:50
  • @Arkady, Cat: see the end of my answer, someone wrote a parser that account that sort of things, I think that should solve your issue. – njzk2 Feb 03 '14 at 18:17

8 Answers8

46

Just read each line and construct a json object at this time:

with open(file_path) as f:
    for line in f:
        j_content = json.loads(line)

This way, you load proper complete json object (provided there is no \n in a json value somewhere or in the middle of your json object) and you avoid memory issue as each object is created when needed.

There is also this answer.:

https://stackoverflow.com/a/7795029/671543

Community
  • 1
  • 1
njzk2
  • 38,969
  • 7
  • 69
  • 107
  • 1
    Thanks for sharing the link, @njzk2 the code you wrote doesn't quite work though: `json.loads` raises an exception if you call it on a partial JSON string... – Cat Feb 03 '14 at 18:59
  • yes, hence my comment `provided there is no \n (...) in the middle of your json object`. Otherwise, the link I added points to an answer with a parser that works with your scenario. – njzk2 Feb 03 '14 at 19:59
  • `json.loads` fails because there are no commas between the JSON objects, irrespective of newlines being present or not... – Cat Feb 03 '14 at 20:02
  • 4
    No. `json.loads` fails because the line does not contain a complete jsonobject. `for line in f` loops on the lines of you file. If a line does not contain a complete jsonobject (such as if it is split on several lines), it fails. – njzk2 Feb 03 '14 at 20:04
  • 2
    Alternatively and perhaps concisely,`[json.loads(line) for line in f]` could make code in oneline and possible for nesting in future. – 千木郷 Sep 27 '18 at 07:15
11
contents = open(file_path, "r").read() 
data = [json.loads(str(item)) for item in contents.strip().split('\n')]
Tjorriemorrie
  • 16,818
  • 20
  • 89
  • 131
7

This will work for the specific file format that you gave. If your format changes, then you'll need to change the way the lines are parsed.

{
    "key11": 11,
    "key12": 12
}
{
    "key21": 21,
    "key22": 22
}

Just read line-by-line, and build the JSON blocks as you go:

with open(args.infile, 'r') as infile:

    # Variable for building our JSON block
    json_block = []

    for line in infile:

        # Add the line to our JSON block
        json_block.append(line)

        # Check whether we closed our JSON block
        if line.startswith('}'):

            # Do something with the JSON dictionary
            json_dict = json.loads(''.join(json_block))
            print(json_dict)

            # Start a new block
            json_block = []

If you are interested in parsing one very large JSON file without saving everything to memory, you should look at using the object_hook or object_pairs_hook callback methods in the json.load API.

Dane White
  • 3,443
  • 18
  • 16
5

This expands Cohen's answer:

content_object = s3_resource.Object(BucketName, KeyFileName)
file_buffer = io.StringIO()
file_buffer = content_object.get()['Body'].read().decode('utf-8')

json_lines = []
for line in file_buffer.splitlines():
    j_content = json.loads(line)
    json_lines.append(j_content)

df_readback = pd.DataFrame(json_lines)

This assumes that the entire file will fit in memory. If it is too big then this will have to be modified to read in chunks or use Dask.

denson
  • 2,366
  • 2
  • 24
  • 25
2

Had to read some data from AWS S3 and parse a newline delimited jsonl file. My solution was this using splitlines

The code:

for line in json_input.splitlines():
     one_json = json.loads(line)
Cohen
  • 640
  • 1
  • 6
  • 6
  • 2
    `splitlines` is not safe for JSON Lines, it can split a JSON line in the middle if there are strings with certain characters, such as `NEL` (`0x85`). – Gallaecio Nov 30 '20 at 11:28
  • Didn't know that, has worked for me for a long long time but good to know I guess. – Cohen Dec 06 '20 at 13:38
1

The line by line reading approach is good, as mentioned in some of the above answers.

However across multiple JSON tree structures I would recommend decomposition into 2 functions to have more robust error handling.

For example,

def load_cases(file_name):
    with open(file_name) as file:
        cases = (parse_case_line(json.loads(line)) for line in file)
        cases = filter(None, cases)
        return list(cases)

parse_case_line can encapsulate the key parsing logic required in your above example, for example with regex matching, or application-specific requirements. It also means that you can select which json key-values you want to parse out.

Another advantage of this approach is filter handles multiple \n in the middle of your json object, and parses the whole file :-).

Pranav Kasetti
  • 8,770
  • 2
  • 50
  • 71
0

The jq python library makes quick work of handling cases like this.

Here's an example from my project. I had the file object from elsewhere in my case.

import jq

raw = file.read().decode("utf-8")
for row in iter(jq.compile(".").input(text=raw)):
  print(row)

And the dependency can be installed as such:

pip install jq

Look into jq, it's an easy way to query json objects, on the command line or in code.

hackpoetic
  • 75
  • 5
-1

Just read it line by line and parse e through a stream while ur hacking trick (adding commas between each JSON string and also a beginning and ending square bracket to make it a proper list) isn't memory-friendly if the file is too more than 1GB as the whole content will land on the RAM.

xosg
  • 89
  • 1
  • 6