-1

I have a large json file (2.4 GB). I want to parse it in python. The data looks like the following:

[
{
  "host": "a.com",
  "ip": "1.2.2.3",
  "port": 8
},
{
  "host": "b.com",
  "ip": "2.5.0.4",
  "port": 3

},
{
  "host": "c.com",
  "ip": "9.17.6.7",
  "port": 4
}
]

I run this python script parser.py to load the data for parsing::

import json
from pprint import pprint


with open('mydata.json') as f:
    data = json.load(f)

Traceback (most recent call last): File "parser.py", line xx, in data = json.load(f) File "/usr/lib/python3.6/json/init.py", line 296, in load return loads(fp.read(), MemoryError

1) Can you please advise me how to load large files for parsing without such an error?

2) Any alternative methods?

user9371654
  • 2,160
  • 16
  • 45
  • 78
  • Are you creating these massive JSON files? If so, you might want to consider using a different format. While JSON _can_ be parsed iteratively (as explained in Karl’s great answer), it’s not really designed for such uses, and often there’s something better (which may be as simple as transposing the data into something you can save as a bunch of JSON files/a zip file of JSON files/a JSONlines file, or may be as complex as using a database). – abarnert Aug 24 '18 at 16:39
  • If you’re on a 64-bit platform and you have 8+GB of RAM and running a 32-bit Python, it’s possible that switching to a 64-bit Python will give you a quick fix. Probably not a great solution even if it works, but if you just need quick&dirty process-this-one-file-this-one-time… – abarnert Aug 24 '18 at 16:42

1 Answers1

1

The problem is because the file is too large to load into the program, so you must load in sections at a time.
I would recommend using ijson or json-streamer which can load in the json file iteratively instead of trying to load the whole file into memory at once.

Here's an example of using ijson:

import ijson

entry = {}  # Keeps track of values for each json item
parser = ijson.parse(open('mydata.json'))

for prefix, event, value in parser:
    # Start of item map
    if (prefix, event) == ('item', 'start_map'):
        entry = {}  # Start of a new json item
    elif prefix.endswith('.host'):
        entry['host'] = value  # Add value to entry
    elif prefix.endswith('.ip'):
        entry['ip'] = value
    elif prefix.endswith('.port'):
        entry['port'] = value
    elif (prefix, event) == ('item', 'end_map'):
        print(entry)  # Do something with complete entry object

Each prefix stores the prefix path for the current item being interated in the json. The event is used to detect the start/end of maps or arrays. And the value is used to store the value of the current object being iterated on.

Karl
  • 1,664
  • 2
  • 12
  • 19
  • I run this example and got the output as: `{'host': 'a.com', 'ip': '1.2.2.3', 'port': 8}` Do you have any idea how to print the output without the labels or commas? i.e. `a.com,1.2.2.3,8` without making another python parser? That's how I want it. – user9371654 Aug 24 '18 at 16:47
  • The output is because it is printing out the whole `entry` dictionary. You probably will want to keep it as a dictionary because those are easy to work with. But if you really do want to print out the values comma separated then just do a `print('{},{},{}'.format(entry['host'], entry['ip'], entry['port']))` – Karl Aug 24 '18 at 16:51
  • Unfortunately, it run for the the first 17403 records then stopped: `raise UnexpectedSymbol(symbol, pos) ijson.backends.python.UnexpectedSymbol: Unexpected symbol 'S' at 56649111`. Do you have any clue how to solve the issue? or can you provide an edited answer with another library that can help me read array of json objects stored in a file. The file size is around 3 GB. – user9371654 Aug 24 '18 at 17:12
  • That seems like an issue with your json file not being formatted correctly and containing some 'S' character in it and not an issue with the file size. – Karl Aug 24 '18 at 17:19