1

I'm trying to parse the json format data to json.load() method. But it's giving me an error. I tried different methods like reading line by line, convert into dictionary, list, and so on but it isn't working. I also tried the solution mention in the following url loading-and-parsing-a-json but it give's me the same error.

import json
data = []
with open('output.txt','r') as f:
    for line in f:
         data.append(json.loads(line))

Error:

ValueError: Extra data: line 1 column 71221 - line 1 column 6783824 (char 71220 - 6783823)

Please find the output.txt in the below URL

Content- output.txt

Community
  • 1
  • 1
samy
  • 65
  • 1
  • 8
  • 1
    Please add the contents of output.txt – grooveplex May 10 '16 at 18:58
  • did you try `with open('output.txt','r') as f: json.loads(f.read())` ? The suggestion from your link to read line by line was because that person's JSON data was lots of structures, one per line - you can see they posted it in the question's edit history. That probably doesn't apply in your case. – TessellatingHeckler May 10 '16 at 18:59
  • Yes, same error. File size is approx 6.5 MB. – samy May 10 '16 at 19:01
  • Just to clarify, the _load()_ and _loads()_ methods are different. The former accepts a fp (file pointer) whereas the latter accepts a string. – Mark May 10 '16 at 19:03
  • Related: http://stackoverflow.com/questions/27907633/multiple-json-objects-in-one-file-extract-by-python , http://stackoverflow.com/questions/20400818/python-trying-to-deserialize-multiple-json-objects-in-a-file-with-each-object-s https://gist.github.com/sampsyo/920215 – Robᵩ May 10 '16 at 19:24
  • json.loads() decodes the json data structure, loading it line by line will not work. Try load the file handler straight away with `json.load(open('output.txt','r'))` – ppm9 May 10 '16 at 19:53

2 Answers2

1

Your alleged JSON file is not a properly formatted JSON file. JSON files must contain exactly one object (a list, a mapping, a number, a string, etc). Your file appears to contain a number of JSON objects in sequence, but not in the correct format for a list.

Your program's JSON parser correctly returns an error condition when presented with this non-JSON data.

Here is a program that will interpret your file:

import json

# Idea and some code stolen from https://gist.github.com/sampsyo/920215

data = []
with open('output.txt') as f:
    s = f.read()
decoder = json.JSONDecoder()

while s.strip():
    datum, index = decoder.raw_decode(s)
    data.append(datum)
    s = s[index:]

print len(data)
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • I'm getting these data from twitter api. How did found out that JSON file is not properly formatted? In my case, is there any work around , as i don't have much control on twitter api data. – samy May 10 '16 at 19:27
  • Can you update your question with the code you use that creates the file? The best solution is to fix the bug in that code. The 2nd-best solution is the work-around I'm putting in my answer. – Robᵩ May 10 '16 at 19:32
  • My first clue that the file was improperly formatted was your error message. Python's JSON library accepts all well-formed JSON files. My 2nd clue was when I opened the file and scrolled to line 1 column 71221. I found a close-brace (`}`) immediately followed by an open brace (`{`). This is not allowed in the JSON syntax. – Robᵩ May 10 '16 at 19:36
1

I wrote up the following which will break up your file into one JSON string per line and then go back through it and do what you originally intended. There's certainly room for optimization here, but at least it works as you expected now.

import json
import re

PATTERN = '{"statuses"'
file_as_str = ''

with open('output.txt', 'r+') as f:
    file_as_str = f.read()
    m = re.finditer(PATTERN, file_as_str)
    f.seek(0)
    for pos in m:
        if pos.start() == 0:
            pass
        else:
            f.seek(pos.start())
            f.write('\n{"')

data = []

with open('output.txt','r') as f:
    for line in f:
        data.append(json.loads(line))
Mark
  • 829
  • 11
  • 22
  • Actually I just realized I'm overwriting the 's' in "statuses" during the overwrite in the first loop, so each key will be "tatuses" – Mark May 10 '16 at 20:19