-1

I am trying to implement an inverted index from documents from the Cranfield Collection (in a .json file). Below there is a piece of what the .json file contains. In reality there are 1400 of these, but here are the first and last lines.

{
  "add" : {
    "doc" : {
      "id" : 1,
      "author" : "brenckman,m.",
      "bibliography" : "j. ae. scs. 25, 1958, 324.",
      "body" : "a lot of text.",
      "title" : "title 1."
    }
  },
  "add" : {
    "doc" : {
      "id" : 1400,
      "author" : "kleeman,p.w.",
      "bibliography" : "arc r + m.2971, 1953.",
      "body" : "a lot of text.",
      "title" : "title 2."
    }
  },
  "commit" : { }
}

However, I am not even able to properly read through the .json file to start making an inverted index. When I run the code given below, it only prints the last object of the .json-file as well as the "commit": {}. So basically everything starting from the second "add" in my example above.

Considering there are 1400 objects, I don't understand why I only get the last one. My code is given below. I have also checked with for example using print(len(data)), which returns 2 when I expect 1400. Any help would be appreciated.

import json
from pprint import pprint

with open("cranfield-data.json", encoding="utf-8") as data_file:
  data = json.loads(data_file.read())

pprint((data))
refnet
  • 5
  • 2
  • @pault, I think I realised now that that I what I want, yes. Then every "add" element should be the start of an array? Should every "doc" and all the other elements also be their own array? – refnet Mar 20 '19 at 18:03
  • 1
    Possible duplicate of [Python json parser allow duplicate keys](https://stackoverflow.com/questions/29321677/python-json-parser-allow-duplicate-keys) – pault Mar 20 '19 at 18:48
  • Possible duplicate of [Python json parser allow duplicate keys](https://stackoverflow.com/questions/29321677/python-json-parser-allow-duplicate-keys) – Popo Mar 20 '19 at 19:01

2 Answers2

1

The problem you're currently having is that your json object has a single key with muliple values. The solution is to use a customized JSONDecoder.object_pairs_hook as it was explained before in this post.

Python json parser allow duplicate keys

Farhood ET
  • 1,432
  • 15
  • 32
0

Your JSON is malformed. A JSON object, like the Python dictionary it maps to, can only have one item for each key. You have used the same key, "add", each time.

You probably need an array of objects, not a single object.

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
  • Ok, I see. That's probably obvious, but I am quite new to python and json. I was given this .json file from my instructor, so I just assumed it was ready for using. So if I understand correctly, every "add" element needs to be the start of a new array? – refnet Mar 20 '19 at 18:00
  • The keys in a JSON object are not required to be unique, but neither is a JSON decoder required to preserve duplicate keys. The semantics of a JSON object just aren't defined. – chepner Mar 20 '19 at 18:01
  • @chepner ok, so does that mean that my initial code ought to work? Do you have any idea why I am not returning all 1400 objects of the .json-file? – refnet Mar 20 '19 at 18:05
  • JSON may not demand it, but Python does. The whole point of a dictionary is that you can refer to its keys uniquely by their name. – Daniel Roseman Mar 20 '19 at 18:06
  • All right. So I basically have 1400 instances of the same key "add", but it should be a list with each of these dictionaries? – refnet Mar 20 '19 at 18:09