0

I have a large json file (2.4 GB). I want to parse it in python. The data looks like the following:

[
{
  "host": "a.com",
  "ip": "1.2.2.3",
  "port": 8
},
{
  "host": "b.com",
  "ip": "2.5.0.4",
  "port": 3

},
{
  "host": "c.com",
  "ip": "9.17.6.7",
  "port": 4
}
]

I run this python script parser.py to load the data for parsing::

import json
from pprint import pprint


with open('mydata.json') as f:
    data = json.load(f)

Previously, I made this post about the same code. I am trying to run the code with larger RAM. but I got a different error. Can you please help me identify the source of the problem?

Traceback (most recent call last): File "parser.py", line 6, in data = json.load(f) File "/usr/lib/python3.6/json/init.py", line 299, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) File "/usr/lib/python3.6/json/init.py", line 354, in loads return _default_decoder.decode(s) File "/usr/lib/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1095583 column 749 (char 56649111)

There is a similar problem in this post but I could not use the solution as I read my json array from a file. Not sure how to apply the solution in this case?

user9371654
  • 2,160
  • 16
  • 45
  • 78
  • The error message says that your JSON is missing a comma on line 1095583, column 749. So you need to find out why you have malformed JSON. – PM 2Ring Aug 24 '18 at 16:48
  • I suspect the JSON is probably _not_ actually malformed. I'm experiencing a very similar (and transient) problem on a file of around 60MB: sometimes it fails, sometimes not. When it does fail, it's at a different character index within the JSON each time. So I suspect something else is going on, perhaps the Python string is not fully constructed (file is not fully loaded) before `json.load` begins parsing it? – djangodude Dec 05 '18 at 18:13
  • An update on this: on a hunch regarding buffering, I tried a slightly different technique of opening the JSON file as binary, using `open(path, mode='rb', buffering=0)`, then `read()` (as *binary*), `.decode()` to string, and finally use `json.loads()` on the converted string. I haven't had a failure yet...will continue testing. I did not post this as an answer because I'm not really sure it solves the problem yet, but if the OP has a chance to try it I would love to hear their results. – djangodude Dec 05 '18 at 22:44
  • I worked around this issue by: 1) divide the file into smaller chunks (5 files). 2) Manually adding array brackets at the beginning and end of the file. 3) Make sure the last object is not followed by a comma. 4) Then parse each file. 5) merge all results files using linux `cat` command. Finally, I changed the parse to `jq` and I do not use python. – user9371654 Dec 06 '18 at 12:29

0 Answers0