Python – not retrieving all json objects, only the last

Question

I am trying to implement an inverted index from documents from the Cranfield Collection (in a .json file). Below there is a piece of what the .json file contains. In reality there are 1400 of these, but here are the first and last lines.

{
  "add" : {
    "doc" : {
      "id" : 1,
      "author" : "brenckman,m.",
      "bibliography" : "j. ae. scs. 25, 1958, 324.",
      "body" : "a lot of text.",
      "title" : "title 1."
    }
  },
  "add" : {
    "doc" : {
      "id" : 1400,
      "author" : "kleeman,p.w.",
      "bibliography" : "arc r + m.2971, 1953.",
      "body" : "a lot of text.",
      "title" : "title 2."
    }
  },
  "commit" : { }
}

However, I am not even able to properly read through the .json file to start making an inverted index. When I run the code given below, it only prints the last object of the .json-file as well as the "commit": {}. So basically everything starting from the second "add" in my example above.

Considering there are 1400 objects, I don't understand why I only get the last one. My code is given below. I have also checked with for example using print(len(data)), which returns 2 when I expect 1400. Any help would be appreciated.

import json
from pprint import pprint

with open("cranfield-data.json", encoding="utf-8") as data_file:
  data = json.loads(data_file.read())

pprint((data))

@pault, I think I realised now that that I what I want, yes. Then every "add" element should be the start of an array? Should every "doc" and all the other elements also be their own array? — refnet, Mar 20 '19 at 18:03
Possible duplicate of [Python json parser allow duplicate keys](https://stackoverflow.com/questions/29321677/python-json-parser-allow-duplicate-keys) — pault, Mar 20 '19 at 18:48
Possible duplicate of [Python json parser allow duplicate keys](https://stackoverflow.com/questions/29321677/python-json-parser-allow-duplicate-keys) — Popo, Mar 20 '19 at 19:01

score 1 · Accepted Answer · answered Mar 20 '19 at 18:08

1

The problem you're currently having is that your json object has a single key with muliple values. The solution is to use a customized JSONDecoder.object_pairs_hook as it was explained before in this post.

Python json parser allow duplicate keys

answered Mar 20 '19 at 18:08

Farhood ET

1,432
15
32

@refnet no problem, but if your problem was solved using this, it's better to check it as the main answer for people who might have the same problem in the future. – Farhood ET Mar 20 '19 at 18:12
2

If this is the answer, we should close the question as a duplicate. – chepner Mar 20 '19 at 18:37
@chepner yeah probably better to archive it as a duplicate. – Farhood ET Mar 20 '19 at 18:56

score 0 · Answer 2 · answered Mar 20 '19 at 17:55

0

Your JSON is malformed. A JSON object, like the Python dictionary it maps to, can only have one item for each key. You have used the same key, "add", each time.

You probably need an array of objects, not a single object.

answered Mar 20 '19 at 17:55

Daniel Roseman

588,541
66
880
895

Ok, I see. That's probably obvious, but I am quite new to python and json. I was given this .json file from my instructor, so I just assumed it was ready for using. So if I understand correctly, every "add" element needs to be the start of a new array? – refnet Mar 20 '19 at 18:00
The keys in a JSON object are not required to be unique, but neither is a JSON decoder required to preserve duplicate keys. The semantics of a JSON object just aren't defined. – chepner Mar 20 '19 at 18:01
@chepner ok, so does that mean that my initial code ought to work? Do you have any idea why I am not returning all 1400 objects of the .json-file? – refnet Mar 20 '19 at 18:05
JSON may not demand it, but Python does. The whole point of a dictionary is that you can refer to its keys uniquely by their name. – Daniel Roseman Mar 20 '19 at 18:06
All right. So I basically have 1400 instances of the same key "add", but it should be a list with each of these dictionaries? – refnet Mar 20 '19 at 18:09

Python – not retrieving all json objects, only the last

2 Answers2