Question
I have a text file that records metadata of research papers requested with SemanticScholar API. However, when I wrote requested data, I forgot to add "\n"
for each individual record. This results in something looks like
{<metadata1>}{<metadata2>}{<metadata3>}...
and this should be if I did add "\n"
.
{<metadata1>}
{<metadata2>}
{<metadata3>}
...
Now, I would like to read the data. As all the metadata is now stored in one line, I need to do some hacks
- First I split the cluttered dicts using
"{"
. - Then I tried to convert the string
line
back to dict. Note that I do considerline
might not be in a proper JSON format.
import json
with open("metadata.json", "r") as f:
for line in f.readline().split("{"):
print(json.loads("{" + line.replace("\'", "\"")))
However, there is still an error message
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
I am wondering what should I do to recover all the metadata I collected?
MWE
Note, in order to get metadata.json
file I use, use the following code, it should work out of the box.
import json
import urllib
import requests
baseURL = "https://api.semanticscholar.org/v1/paper/"
paperIDList = ["200794f9b353c1fe3b45c6b57e8ad954944b1e69",
"b407a81019650fe8b0acf7e4f8f18451f9c803d5",
"ff118a6a74d1e522f147a9aaf0df5877fd66e377"]
for paperID in paperIDList:
response = requests.get(urllib.parse.urljoin(baseURL, paperID))
metadata = response.json()
record = dict()
record["title"] = metadata["title"]
record["abstract"] = metadata["abstract"]
record["paperId"] = metadata["paperId"]
record["year"] = metadata["year"]
record["citations"] = [item["paperId"] for item in metadata["citations"] if item["paperId"]]
record["references"] = [item["paperId"] for item in metadata["references"] if item["paperId"]]
with open("metadata.json", "a") as fileObject:
fileObject.write(json.dumps(record))