225

I am getting some data from a JSON file "new.json", and I want to filter some data and store it into a new JSON file. Here is my code:

import json
with open('new.json') as infile:
    data = json.load(infile)
for item in data:
    iden = item.get["id"]
    a = item.get["a"]
    b = item.get["b"]
    c = item.get["c"]
    if c == 'XYZ' or  "XYZ" in data["text"]:
        filename = 'abc.json'
    try:
        outfile = open(filename,'ab')
    except:
        outfile = open(filename,'wb')
    obj_json={}
    obj_json["ID"] = iden
    obj_json["VAL_A"] = a
    obj_json["VAL_B"] = b

And I am getting an error, the traceback is:

  File "rtfav.py", line 3, in <module>
    data = json.load(infile)
  File "/usr/lib64/python2.7/json/__init__.py", line 278, in load
    **kw)
  File "/usr/lib64/python2.7/json/__init__.py", line 326, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/json/decoder.py", line 369, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 88 column 2 - line 50607 column 2 (char 3077 - 1868399)

Here is a sample of the data in new.json, there are about 1500 more such dictionaries in the file

{
    "contributors": null, 
    "truncated": false, 
    "text": "@HomeShop18 #DreamJob to professional rafter", 
    "in_reply_to_status_id": null, 
    "id": 421584490452893696, 
    "favorite_count": 0, 
    "source": "<a href=\"https://mobile.twitter.com\" rel=\"nofollow\">Mobile Web (M2)</a>", 
    "retweeted": false, 
    "coordinates": null, 
    "entities": {
        "symbols": [], 
        "user_mentions": [
            {
                "id": 183093247, 
                "indices": [
                    0, 
                    11
                ], 
                "id_str": "183093247", 
                "screen_name": "HomeShop18", 
                "name": "HomeShop18"
            }
        ], 
        "hashtags": [
            {
                "indices": [
                    12, 
                    21
                ], 
                "text": "DreamJob"
            }
        ], 
        "urls": []
    }, 
    "in_reply_to_screen_name": "HomeShop18", 
    "id_str": "421584490452893696", 
    "retweet_count": 0, 
    "in_reply_to_user_id": 183093247, 
    "favorited": false, 
    "user": {
        "follow_request_sent": null, 
        "profile_use_background_image": true, 
        "default_profile_image": false, 
        "id": 2254546045, 
        "verified": false, 
        "profile_image_url_https": "https://pbs.twimg.com/profile_images/413952088880594944/rcdr59OY_normal.jpeg", 
        "profile_sidebar_fill_color": "171106", 
        "profile_text_color": "8A7302", 
        "followers_count": 87, 
        "profile_sidebar_border_color": "BCB302", 
        "id_str": "2254546045", 
        "profile_background_color": "0F0A02", 
        "listed_count": 1, 
        "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", 
        "utc_offset": null, 
        "statuses_count": 9793, 
        "description": "Rafter. Rafting is what I do. Me aur mera Tablet.  Technocrat of Future", 
        "friends_count": 231, 
        "location": "", 
        "profile_link_color": "473623", 
        "profile_image_url": "http://pbs.twimg.com/profile_images/413952088880594944/rcdr59OY_normal.jpeg", 
        "following": null, 
        "geo_enabled": false, 
        "profile_banner_url": "https://pbs.twimg.com/profile_banners/2254546045/1388065343", 
        "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", 
        "name": "Jayy", 
        "lang": "en", 
        "profile_background_tile": false, 
        "favourites_count": 41, 
        "screen_name": "JzayyPsingh", 
        "notifications": null, 
        "url": null, 
        "created_at": "Fri Dec 20 05:46:00 +0000 2013", 
        "contributors_enabled": false, 
        "time_zone": null, 
        "protected": false, 
        "default_profile": false, 
        "is_translator": false
    }, 
    "geo": null, 
    "in_reply_to_user_id_str": "183093247", 
    "lang": "en", 
    "created_at": "Fri Jan 10 10:09:09 +0000 2014", 
    "filter_level": "medium", 
    "in_reply_to_status_id_str": null, 
    "place": null
} 
Zoe
  • 27,060
  • 21
  • 118
  • 148
Apoorv Ashutosh
  • 3,834
  • 7
  • 23
  • 24
  • 3
    This is the error you get whenever the input JSON has more than one object per line. Many of the answer here assume there is only one object per line, or construct examples obeying that, but would break if that wasn't the case. – smci Jan 03 '20 at 14:17
  • 1
    @smci : Can you explain the line `more than one object per line` – aspiring1 Feb 18 '20 at 09:11
  • 1
    @smci I think you meant "more than one line per object"? – Karl Knechtel Feb 04 '23 at 08:21
  • 1
    Yes, "more than one line per object", silly me... – smci Feb 07 '23 at 23:47

11 Answers11

204

As you can see in the following example, json.loads (and json.load) does not decode multiple json object.

>>> json.loads('{}')
{}
>>> json.loads('{}{}') # == json.loads(json.dumps({}) + json.dumps({}))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\json\__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "C:\Python27\lib\json\decoder.py", line 368, in decode
    raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 3 - line 1 column 5 (char 2 - 4)

If you want to dump multiple dictionaries, wrap them in a list, dump the list (instead of dumping dictionaries multiple times)

>>> dict1 = {}
>>> dict2 = {}
>>> json.dumps([dict1, dict2])
'[{}, {}]'
>>> json.loads(json.dumps([dict1, dict2]))
[{}, {}]
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • 10
    Can you please explain again with reference to the code I gave above? I am a newbie, and at times take long to grasp such things. – Apoorv Ashutosh Jan 11 '14 at 05:43
  • 2
    @ApoorvAshutosh, It seems like `new.json` contains a json and another redundant data. `json.load`, `json.loads` can only decode a json. It raise a `ValueError` when it encounter addtional data as you see. – falsetru Jan 11 '14 at 05:48
  • Have pasted a sample from new.json, and I am filtering out some data from it, so I don't get where I am getting extra data from – Apoorv Ashutosh Jan 11 '14 at 05:50
  • 2
    @ApoorvAshutosh, You said **1500 more such dictionaries** in the edited question. That's the additional data. If you're the one who made a `new.json`, just put a single json in a file. – falsetru Jan 11 '14 at 05:51
  • 2
    @ApoorvAshutosh, If you need to dump multiple dictionaries as json, wrap them in a list, and dump the list. – falsetru Jan 11 '14 at 05:53
  • the issue here is not about loading into a JSON file, that has already happened. Can you tell me how to retrieve data from there? I already have a file that has dictionaries in it. I now have to retrieve each of those dictionaries. http://stackoverflow.com/questions/21059466/python-json-parser – Apoorv Ashutosh Jan 11 '14 at 07:19
  • @ApoorvAshutosh, BTW, trailing ',' is missing in the json (in the new question). (at the line `"x": []`) => invalid json. – falsetru Jan 11 '14 at 07:27
  • sure, asap. And could you just look into one more thing, as I said, about how to read from a file with multiple dictionaries – Apoorv Ashutosh Jan 11 '14 at 07:27
  • @ApoorvAshutosh, I'm doing research that issue. I will post answer there if research is done. – falsetru Jan 11 '14 at 07:28
  • Thats just a sample, I mentioned it in a comment – Apoorv Ashutosh Jan 11 '14 at 07:28
  • @ApoorvAshutosh, Please post a valid sample! – falsetru Jan 11 '14 at 07:29
  • @ApoorvAshutosh, No, I mean the sample in the **new** question. – falsetru Jan 11 '14 at 07:31
  • Its for this very sample, the structure of the dictionaries is basically the same. However, I'll edit that question with this very sample – Apoorv Ashutosh Jan 11 '14 at 07:32
  • @ApoorvAshutosh, I posted an answer that workaround the issue. Check it out. – falsetru Jan 11 '14 at 07:49
  • Can I ask that why it still works when I use `json.dump` instead of `json.dumps`? I am using Python 3.5.2 – Aaron Liu Sep 23 '16 at 00:30
  • @ShuruiLiu, Please post a separated question. – falsetru Sep 23 '16 at 17:00
  • as someone who has an issue such as this from a json web scrape. I ran the code through a linter to see if it is valid json. It seems that it is, so why would this error still call? – Fallenreaper Jul 15 '17 at 18:56
  • I was trying with this option, but I saw another useful way to get all items : `file.readlines()` which returns a list of sentences. – Manuel Lazo Feb 04 '21 at 20:31
200

Iterate over the file, loading each line as JSON in the loop:

tweets = []
with open('tweets.json', 'r') as file:
    for line in file:
        tweets.append(json.loads(line))

This avoids storing intermediate python objects. As long as you write one full tweet per append() call, this should work.

Kyle F Hartzenberg
  • 2,567
  • 3
  • 6
  • 24
Adam Hughes
  • 14,601
  • 12
  • 83
  • 122
  • 20
    The accepted answer addresses how to fix the source of the problem if you control the process of exporting, but if you are using someone else's data and you just have to deal with it, this is a great low-overhead method. – charlesreid1 Mar 12 '17 at 02:57
  • 5
    Many datasets (e.g.: Yelp dataset) nowadays are provided as "set" of Json objects and your approach it's convenient to load them. – Gabrer Jan 25 '18 at 00:15
  • 2
    This **only** works for inputs that have one complete JSON object **per line**. That is a common input format (it is **not** JSON, but a related format sometimes called either JSONL or NDJSON), but it is *not what is shown in the OP*. – Karl Knechtel Feb 04 '23 at 08:19
66

I came across this because I was trying to load a JSON file dumped from MongoDB. It was giving me an error

JSONDecodeError: Extra data: line 2 column 1

The MongoDB JSON dump has one object per line, so what worked for me is:

import json
data = [json.loads(line) for line in open('data.json', 'r')]
Nic Scozzaro
  • 6,651
  • 3
  • 42
  • 46
  • 2
    I still get `json.decoder.JSONDecodeError: Extra data: line 1 column 954 (char 953)` with this answer's code. My data file must have a different problem. – Sander Heinsalu Jan 18 '21 at 16:04
17

This may also happen if your JSON file is not just 1 JSON record. A JSON record looks like this:

[{"some data": value, "next key": "another value"}]

It opens and closes with a bracket [ ], within the brackets are the braces { }. There can be many pairs of braces, but it all ends with a close bracket ]. If your json file contains more than one of those:

[{"some data": value, "next key": "another value"}]
[{"2nd record data": value, "2nd record key": "another value"}]

then loads() will fail.

I verified this with my own file that was failing.

import json

guestFile = open("1_guests.json",'r')
guestData = guestFile.read()
guestFile.close()
gdfJson = json.loads(guestData)

This works because 1_guests.json has one record []. The original file I was using all_guests.json had 6 records separated by newline. I deleted 5 records, (which I already checked to be bookended by brackets) and saved the file under a new name. Then the loads statement worked.

Error was

   raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 2 column 1 - line 10 column 1 (char 261900 - 6964758)

PS. I use the word record, but that's not the official name. Also, if your file has newline characters like mine, you can loop through it to loads() one record at a time into a json variable.

VISQL
  • 1,960
  • 5
  • 29
  • 41
  • 3
    Is there a way to get `json.loads` to read newline-delimited json chunks? That is, to act like `[json.loads(x) for x in text.split('\n')]`? Related: Is there a guarantee that `json.dumps` will not include literal newlines in its output with default indenting? – Ben Dec 31 '15 at 13:59
  • 2
    @Ben, by default `json.dumps` will change newlines in text content to `"\n"`, keeping your json to a single line. – jchook Sep 21 '16 at 16:24
15

I just got the same error while my json file is like this

{"id":"1101010","city_id":"1101","name":"TEUPAH SELATAN"}
{"id":"1101020","city_id":"1101","name":"SIMEULUE TIMUR"}

And I found it malformed, so I changed it to:

{
  "datas":[
    {"id":"1101010","city_id":"1101","name":"TEUPAH SELATAN"},
    {"id":"1101020","city_id":"1101","name":"SIMEULUE TIMUR"}
  ]
}
Zoe
  • 27,060
  • 21
  • 118
  • 148
Akbar Noto
  • 570
  • 6
  • 11
  • 2
    loading just like yours, json.load(infile) – Akbar Noto May 15 '19 at 09:33
  • For the record, if this is the entire JSON file, an outer map is redundant. The root [can be an array](https://stackoverflow.com/a/3833312/6296561), which lets you simplify the second JSON to just be an array. No need for a useless key in a useless map if you're storing array data - just throw it in a root array – Zoe Sep 05 '21 at 13:22
  • @Zoe oh that's interesting, could you provide us some example? – Akbar Noto Sep 06 '21 at 15:20
  • 1
    It's not exactly hard. Just wrap the two maps in an array: `[{"id":"1101010","city_id":"1101","name":"TEUPAH SELATAN"}, {"id":"1101020","city_id":"1101","name":"SIMEULUE TIMUR"}]`. Parsing is identical, access is `obj[0]`, `obj[1]`, ... (read: just like accessing a normal array), and the objects you get are identical. The one you have in your answer would require `obj["datas"][0]`, so it's functionally identical – Zoe Sep 06 '21 at 15:24
12

One-liner for your problem:

data = [json.loads(line) for line in open('tweets.json', 'r')]
Eric Aya
  • 69,473
  • 35
  • 181
  • 253
Nihal
  • 5,262
  • 7
  • 23
  • 41
  • 8
    This is not a general solution, it assumes the input has one JSON object per line, and breaks it it doesn't. – smci Jan 03 '20 at 14:16
9

If you want to solve it in a two-liner you can do it like this:

with open('data.json') as f:
    data = [json.loads(line) for line in f]
coreehi
  • 177
  • 1
  • 6
5

I think saving dicts in a list is not an ideal solution here proposed by @falsetru.

Better way is, iterating through dicts and saving them to .json by adding a new line.

Our 2 dictionaries are

d1 = {'a':1}

d2 = {'b':2}

you can write them to .json

import json
with open('sample.json','a') as sample:
    for dict in [d1,d2]:
        sample.write('{}\n'.format(json.dumps(dict)))

And you can read json file without any issues

with open('sample.json','r') as sample:
    for line in sample:
        line = json.loads(line.strip())

Simple and efficient

Zoe
  • 27,060
  • 21
  • 118
  • 148
murat yalçın
  • 709
  • 7
  • 10
  • 2
    This is not a general solution, it assumes the input has one JSON object per line, and breaks it it doesn't. – smci Jan 03 '20 at 14:16
4

My json file was formatted exactly as the one in the question but none of the solutions here worked out. Finally I found a workaround on another Stackoverflow thread. Since this post is the first link in Google search, I put the that answer here so that other people come to this post in the future will find it more easily.

As it's been said there the valid json file needs "[" in the beginning and "]" in the end of file. Moreover, after each json item instead of "}" there must be a "},". All brackets without quotations! This piece of code just modifies the malformed json file into its correct format.

https://stackoverflow.com/a/51919788/2772087

CodeLiker
  • 67
  • 1
  • 8
4

The error is due to the \nsymbol if you use the read()method of the file descriptor... so don't bypass the problem by using readlines()& co but just remove such character!

import json

path = # contains for example {"c": 4} also on multy-lines

new_d = {'new': 5}
with open(path, 'r') as fd:
    d_old_str = fd.read().replace('\n', '') # remove all \n
    old_d = json.loads(d_old_str)

# update new_d (python3.9 otherwise new_d.update(old_d))
new_d |= old_d
          
with open(path2, 'w') as fd:
    fd.write(json.dumps(new_d)) # save the dictionary to file (in case needed)

... and if you really really want to use readlines() here an alternative solution

new_d = {'new': 5}
with open('some_path', 'r') as fd:
    d_old_str = ''.join(fd.readlines()) # concatenate the lines
    d_old = json.loads(d_old_str)

# then as above
cards
  • 3,936
  • 1
  • 7
  • 25
2

If your data is from a source outside your control, use this

def load_multi_json(line: str) -> [dict]:
    """
    Fix some files with multiple objects on one line
    """
    try:
        return [json.loads(line)]
    except JSONDecodeError as err:
        if err.msg == 'Extra data':
            head = [json.loads(line[0:err.pos])]
            tail = FrontFile.load_multi_json(line[err.pos:])
            return head + tail
        else:
            raise err
Khoa
  • 175
  • 1
  • 8