I am trying to implement an inverted index from documents from the Cranfield Collection (in a .json file). Below there is a piece of what the .json file contains. In reality there are 1400 of these, but here are the first and last lines.
{
"add" : {
"doc" : {
"id" : 1,
"author" : "brenckman,m.",
"bibliography" : "j. ae. scs. 25, 1958, 324.",
"body" : "a lot of text.",
"title" : "title 1."
}
},
"add" : {
"doc" : {
"id" : 1400,
"author" : "kleeman,p.w.",
"bibliography" : "arc r + m.2971, 1953.",
"body" : "a lot of text.",
"title" : "title 2."
}
},
"commit" : { }
}
However, I am not even able to properly read through the .json file to start making an inverted index. When I run the code given below, it only prints the last object of the .json-file as well as the "commit": {}
. So basically everything starting from the second "add" in my example above.
Considering there are 1400 objects, I don't understand why I only get the last one. My code is given below. I have also checked with for example using print(len(data))
, which returns 2 when I expect 1400. Any help would be appreciated.
import json
from pprint import pprint
with open("cranfield-data.json", encoding="utf-8") as data_file:
data = json.loads(data_file.read())
pprint((data))