4

I have a json file that is 2 GB, and when I try to load it I'm getting this error:

json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 1093156512 (char 1093156511)

So this means that there is probably some escape sequence, right?(or something like that..) that is messing up the json correct? The issue is that this file is huge, and just opening it in the editor is a huge pain. The editor 100% crashes before I can see what the issue is. However, I still need to fix this issue somehow.... I'm not sure what can be causing this issue.... it can be many things.

my data is essentially a list of objects like so:

data = [{key1: 123, key2:"this is the first string to concatenate"},
 {key1: 131, key2:"this is the second string to concatenate"},
 {key1: 152, key2:"this is the third string to concatenate"} ] 

Except with more complicated key2's. If the issue was an \, if I got rid of all of \'s within the json file would it work? However, there is nothing to say that an odd escape character is my issue.... also, I have very little control about what my input json file is, so I dont think I would be able to change that anyway.

Is there anyway to fix this issue without changing the input json file?

[EDIT] This is the whole error trace:

File "halp.py", line 38, in data = json.load(json_file,strict=False)

File "/usr/lib/python3.6/json/init.py", line 299, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

File "/usr/lib/python3.6/json/init.py", line 367, in loads return cls(**kw).decode(s)

File "/usr/lib/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end())

File "/usr/lib/python3.6/json/decoder.py", line 355, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 1093156512 (char 1093156511)

When I seek there I get:

eers in the fridge!"}, {"city_name": "Portland", "comments": "A cute space to rest your head in Portland. We just stayed for one night, briefly met Adam who was lovely! Appreciated the beers and coffe
Community
  • 1
  • 1
ocean800
  • 3,489
  • 13
  • 41
  • 73
  • What code is being used to throw the JSONDecodeError ? – JacobIRR Apr 24 '17 at 03:46
  • Try this...http://stackoverflow.com/questions/10382253/reading-rather-large-json-files-in-python#10382359 – OneCricketeer Apr 24 '17 at 03:49
  • @JacobIRR I've added the whole error trace – ocean800 Apr 24 '17 at 03:49
  • 1
    How was the file generated? Are you sure it was output correctly from that process? – OneCricketeer Apr 24 '17 at 03:50
  • @cricket_007 The file was generated from an endpoint, and yes I am sure the output was gotten correctly. – ocean800 Apr 24 '17 at 03:51
  • 2
    You could view the problematic code by something like `with open(json_file,'rb') as f: f.seek(1093156450); data=f.read(200)`. Read a section of the file near the failing offset and see what is wrong. – Mark Tolonen Apr 24 '17 at 03:51
  • @MarkTolonen Thanks for the tip. I tried it and edited my question for more info. Puzzling as there doesn't seem to be an issue... ? – ocean800 Apr 24 '17 at 03:57
  • 1
    There's an "unterminated string", meaning one of those quotations doesn't match something that could have came way before or after that listed position – OneCricketeer Apr 24 '17 at 04:00
  • 1
    Yes, you may have to play around with the seek and the read to get enough information. It looks like 1093156512 points to the starting quote of `"A cute space...` (assuming you used my numbers). Read more to see if there is a terminating quote. – Mark Tolonen Apr 24 '17 at 04:03
  • Possibly related: [Is there a memory efficient and fast way to load big json files in python?](https://stackoverflow.com/q/2400643) and [Reading rather large json files in Python](https://stackoverflow.com/q/10382253). – dbc Mar 19 '20 at 16:09

3 Answers3

2

I discovered the good guys at Luminoso have written a Library to sort this kind of issue.

Apparently, sometimes you might have to deal with text that comes out of other code. where the text has often passed through several different pieces of software, each with their own quirks, probably with Microsoft Office somewhere in the chain.

This is where ftfy comes to the rescue.

from ftfy import fix_text
import json
# text = some text source with a potential unicode problem
fixed_text = fix_text(text)
data = json.loads(fixed_text)
unlockme
  • 3,897
  • 3
  • 28
  • 42
  • 1
    What makes you think the problem has anything to do with Unicode? That library won't repair corrupted JSON. –  Mar 22 '19 at 03:06
  • 10
    @duskwuff, I am not a json guru. But my life would have been a little painless if such as answer was available as quickly through this forum. Maybe it may not solve this particular problem. But the error it solved for me is exactly the same as the one highlighted in the title. This is why I cared to come back and point the answer here. Being a Negative trol does not help people. Many times its enough to leave a comment. – unlockme Mar 22 '19 at 03:15
1

I was having this problem with my data, and tried many of the things recommended online to no avail. Finally, I just read in the json lines to a dictionary, line by line, skipping any lines that raised an exception. Then, I loaded the dictionary into a DataFrame: no error.

In the code below, you can see that I actually read the lines into a dictionary of dictionaries (using enumerate to get a numeric key); this gives Pandas an index to use and avoids an error. I also had to transpose the df ('T') to get the data like I wanted.

This is with a json lines file, so the code below won't work for a regular json file, but I'm sure the same principle can be used.

I ended up losing about 20 lines in over 388K lines of data. This doesn't matter for me because my data is a sample anyway. If you actually need every line of your data, this isn't the ideal solution. But if you don't, it seems that the easiest way to deal with this problem is to just toss out the bad apples.

import pandas as pd
import json

filename = 'data.jl'  #json lines file

with open(filename) as f:
    lines = f.read().splitlines()

my_dict = {}
for i, line in enumerate(lines):
    try:
        my_dict[i] = json.loads(line)
    except:
        pass

df = pd.DataFrame.from_dict(my_dict).T
larapsodia
  • 594
  • 4
  • 15
-1

Might not work if you're loading from a file, but for me, the problem was the way I prepared the JSON:

I had a string containing the JSON data and split it at " = " to get rid of some JavaScript around it. It worked for most inputs, but once in a while, this sequence (" = ") was inside of the JSON data, which resulted in an incomplete (unterminated) string.

My solution was stripping instead of splitting:

json_str = str_with_js.lstrip("foo = ").rstrip(";")
json_obj = json.loads(json_str)