2

Having looked everywhere on Google and having no success at finding a solution to this problem, I continue getting the following error:

JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)

The error occurs at the line: row = json.loads(row) in my Python file. The JSON file contains a section from the Reddit comments from 2015-05:

JSON (learn\learning_data\2015\RC_2015-05):

{
  "created_utc": "1430438400",
  "ups": 4,
  "subreddit_id": "t5_378oi",
  "link_id": "t3_34di91",
  "name": "t1_cqug90g",
  "score_hidden": false,
  "author_flair_css_class": null,
  "author_flair_text": null,
  "subreddit": "soccer_jp",
  "id": "cqug90g",
  "removal_reason": null,
  "gilded": 0,
  "downs": 0,
  "archived": false,
  "author": "rx109",
  "score": 4,
  "retrieved_on": 1432703079,
  "body": "\u304f\u305d\n\u8aad\u307f\u305f\u3044\u304c\u8cb7\u3063\u305f\u3089\u8ca0\u3051\u306a\u6c17\u304c\u3059\u308b\n\u56f3\u66f8\u9928\u306b\u51fa\u306d\u30fc\u304b\u306a",
  "distinguished": null,
  "edited": false,
  "controversiality": 0,
  "parent_id": "t3_34di91"
}

*The JSON data is only a fraction of what I actually have, and I cannot change the format. eg.

{
  "text": "data",
  "text": "data"
}
{
  "text2": "data",
  "text2": "data"
}
{
  "text3": "data",
  "text3": "data"
}
etc...

Python (learn\main.py):

with open("learning_data/{}/RC_{}".format(timeframe.split('-')[0], timeframe), buffering=1000) as f:
for row in f:
    row_counter += 1
    row = json.loads(row)
    body = format_data(row['body'])
    created_utc = row['created_utc']
    parent_id = row['parent_id']
    comment_id = row['name']
    score = row['score']
    subreddit = row['subreddit']       
    parent_data = find_parent(parent_id)

    if score >= 2:
        if acceptable(body):
            existing_comment_score = find_existing_score(parent_id)

The JSON file already has double quotes on everything that needed double quotes. If there was some other error causing this one JSONLint.com claimed the JSON was free from them.

I had been referencing my code from this tutorial, where the tutorial's code worked fine without any errors (this is according to the video attached, for using the code from the link above, I still get the error). Because the tutorial used Python 3.5, I downgraded my Python version and continued to get the same error.

What's the cause of this error if the JSON is already using double quotes and valid by JSONLint?

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
Tom Jaquo
  • 85
  • 1
  • 2
  • 10
  • You do not have to call `json.loads` for each JSON row. You need to pass a complete JSON as argument, e.g., `json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]')` – floatingpurr Apr 04 '18 at 23:18
  • The JSON data I have cannot be changed, I added an example of what it currently looks like. – Tom Jaquo Apr 04 '18 at 23:37

2 Answers2

1

Your JSON has newlines in it.

But your code is reading one row at a time and expecting it to be a complete JSON text:

for row in f:
    row_counter += 1
    row = json.loads(row)

That's not going to work.

If your file is just a single JSON text, just read the whole thing:

with open("learning_data/{}/RC_{}".format(timeframe.split('-')[0], timeframe), buffering=1000) as f:
    row_counter += 1
    row = json.load(f)

You may want to rename row to something more meaningful, like contents.


If your file is a sequence of JSON texts, and you're generating the file yourself, the right thing to do is to change the way you generate it. A stream of arbitrary JSON texts is not really a valid format. But if you really want to build a format on top of that, you can—e.g., escape all the newlines so that you can parse it line by line. Or you can use a real format. Or you can just write out a big JSON array instead of a bunch of separate JSON texts.


If you can't change the file, you need a strategy to parse it. All of these are almost right:

  • Use the json module's raw_decode method to read the next JSON text and return the decoded value plus the offset to the next one.
  • Balance brackets and braces and split every time the count goes to 0.
  • Scan for newlines and then backtrack to check for open brackets and braces.

Other than bad error handling, the only serious problem with any of these is that they can't possibly do the right thing for numbers as top-level texts. If your top-level texts are all objects, that's not a problem.

So:

with open("learning_data/{}/RC_{}".format(timeframe.split('-')[0], timeframe), buffering=1000) as f:
    contents = f.read()
    decoder = json.JSONDecoder()
    while contents:
        row, idx = decoder.raw_decode(contents)
        row_counter += 1
        contents = contents[idx:].lstrip()
        # etc.

Although if your file is gigantic, you almost certainly want to mmap it and pass a slice/memoryview to raw_decode—or, if that doesn't work because you have Unicode text, you may have to buffer up chunks manually. Not exactly trivial, but then you are parsing a broken format, so this is easier than you should expect. :)

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • "If your file is just a single JSON text, just read the whole thing:" I am going to have more JSON text, `{ "text": "data"}{"text2": "data"}`, I just didn't want to add 600,000+ lines of JSON code. The Reddit Data is (If I'm correct) a "stream of arbitrary JSON texts", which will take forever to 'correct the format of'. As of now, I cannot change the JSON data. – Tom Jaquo Apr 04 '18 at 23:28
  • @TomJaquo Then do what the rest of my answer suggests. – abarnert Apr 04 '18 at 23:41
  • Your last code appears to only cycle through the first section of JSON code I have, perhaps because I had to comment out the `del contents[:idx]` for I get the following error:`'str' object does not support item deletion` – Tom Jaquo Apr 04 '18 at 23:56
  • @TomJaquo Yes, pretty obviously just commenting out the code that skips to the next JSON text means you don’t skip to the next JSON text. Anyway, I was testing with a `bytearray` rather than a `str`; I’ve edited the answer to something that should work with `str`. – abarnert Apr 05 '18 at 00:01
  • @TomJaquo But as I mentioned in the answer, if your real input really is too big to fix up, it’s probably too big to read the whole thing into a str, decode it, and chop it up, and you really need to do something like mmap the file instead. (Although I doubt it really _is_ too big; 600K lines may sound like a lot but it really isn’t that much for a modern computer.) – abarnert Apr 05 '18 at 00:04
  • I'll try that as well. The updated code you provided got me past that error but invited back my other old error I kept getting: `JSONDecodeError: Expecting value: line 1 column 1 (char 0)`. Would this one be causing the other error? (Also, why does the tutorial's code work but not mine with that method?) – Tom Jaquo Apr 05 '18 at 00:13
  • @TomJaquo Sorry, I think `raw_decode` doesn't skip over extra whitespace at the start. Does `contents[idx:].lstrip()` work? If so, I'll edit that into the answer. (Also, which tutorial are you talking about?) – abarnert Apr 05 '18 at 00:17
  • The tutorial I added in my question. The `contents[idx:].lstrip()` is stuck in a loop or something, when I print it, it keeps on going. I have a follow-up question on that, is where I put `row`, I replace with contents? eg. `parent_id = contents['parent_id'] #Not row['parent_id']?` – Tom Jaquo Apr 05 '18 at 01:16
1

A stream of JSON documents, one-per-line, is a format also known as JSONL. This is distinct from "JSON" as such, which only permits one document to a file.

You can easily convert your file into this format by running jq -c . <in.json >out.json. jq is a command-line tool for processing JSON and JSONL documents; the -c flag enables "compact" mode, which puts each one document on each line of output.

Even easier, you can have that done in-line, having your Python code directly iterate over the output of jq:

import subprocess

with open("learning_data/{}/RC_{}".format(timeframe.split('-')[0], timeframe)) as f:
    p = subprocess.Popen(['jq', '-c', '.'], stdin=f, stdout=subprocess.PIPE)
    for line in p.stdout:
        content = json.loads(line)
        # ...process your line's content here.
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • Subprocess cannot find the JSON file, not sure why. Double checked the path and renamed the `RC_2015-05` to `RC_2015-05.json` and it still could not find the file specified. – Tom Jaquo Apr 05 '18 at 00:07
  • Huh? `subprocess` shouldn't be *trying* to find the JSON file. You should already open the JSON file before you run any of the `subprocess` code. The only thing `subprocess` should try to find is the command named `jq`, and if that isn't installed, then you should install it. – Charles Duffy Apr 05 '18 at 00:17
  • I do not have `jq` currently installed, and got an error installing jq: `error: [WinError 2] The system cannot find the file specified`. – Tom Jaquo Apr 05 '18 at 00:26
  • "installing jq" how? (Not that this is in-scope for a Python-tagged question... or on StackOverflow, for that matter; "how do I install software on my operating system?" is more a [SuperUser](https://superuser.com/) topic). – Charles Duffy Apr 05 '18 at 00:33
  • Pardon me, I misread your comment thinking `jq` was a python package. I got the following error: `parse error: Unfinished string at EOF at line 603195, column 263 Error: writing output failed: Invalid argument` – Tom Jaquo Apr 05 '18 at 00:46
  • Spitballing here, but any chance line 603195 might be where your input file cuts off? If it doesn't end at a clean document boundary, then an error is to be expected. If you want to handle that exception, normal Python exception-handling should suffice. (If you wanted to suppress the error message itself, `stderr=subprocess.DEVNULL` will work on Python 3.3 or later; for earlier versions, see https://stackoverflow.com/questions/11269575/how-to-hide-output-of-subprocess-in-python-2-7). – Charles Duffy Apr 05 '18 at 00:49
  • Ah you are correct. It did not end with a `"` and `}`. And I am not getting any errors that I can see. – Tom Jaquo Apr 05 '18 at 00:57