Reading a text file of dictionaries stored in one line

Question

Question

I have a text file that records metadata of research papers requested with SemanticScholar API. However, when I wrote requested data, I forgot to add "\n" for each individual record. This results in something looks like

{<metadata1>}{<metadata2>}{<metadata3>}...

and this should be if I did add "\n".

{<metadata1>}
{<metadata2>}
{<metadata3>}
...

Now, I would like to read the data. As all the metadata is now stored in one line, I need to do some hacks

First I split the cluttered dicts using "{".
Then I tried to convert the string line back to dict. Note that I do consider line might not be in a proper JSON format.

import json
with open("metadata.json", "r") as f:
    for line in f.readline().split("{"):
        print(json.loads("{" + line.replace("\'", "\"")))

However, there is still an error message

JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

I am wondering what should I do to recover all the metadata I collected?

MWE

Note, in order to get metadata.json file I use, use the following code, it should work out of the box.

import json
import urllib
import requests

baseURL = "https://api.semanticscholar.org/v1/paper/"
paperIDList = ["200794f9b353c1fe3b45c6b57e8ad954944b1e69",
               "b407a81019650fe8b0acf7e4f8f18451f9c803d5",
               "ff118a6a74d1e522f147a9aaf0df5877fd66e377"]

for paperID in paperIDList:
    response = requests.get(urllib.parse.urljoin(baseURL, paperID))
    metadata = response.json()
    record = dict()
    record["title"] = metadata["title"]
    record["abstract"] = metadata["abstract"]
    record["paperId"] = metadata["paperId"]
    record["year"] = metadata["year"]
    record["citations"] = [item["paperId"] for item in metadata["citations"] if item["paperId"]]
    record["references"] = [item["paperId"] for item in metadata["references"] if item["paperId"]]
    with open("metadata.json", "a") as fileObject:
        fileObject.write(json.dumps(record))

i suggest you to use pickle.dump() and pickle.load() instead of JSON for manipulating the file — Leonardo Scotti, Nov 19 '20 at 16:46
Why not re-create the metadata.json file with the `\n` at the end of each record? — vighnesh153, Nov 19 '20 at 16:47
@adirabargil I have already provided the script to get `metadata.json` in MWE section. — Mr.Robot, Nov 19 '20 at 17:27
@LeonardoScotti Yes. This is a good solution for smaller files. But in my case, there are almost half a million records and directly using `pickle.dump()` is problematic. — Mr.Robot, Nov 19 '20 at 17:34
@VighneshRaut Thank you for pointing this out! As is pointed out in the accepted answer, I think this is a much better solution than what I am trying to do. — Mr.Robot, Nov 19 '20 at 17:39

wagnifico · Accepted Answer · 2020-11-19T17:38:35.130

1

The problem is that when you do the split("{") you get a first item that is empty, corresponding to the opening {. Just ignore the first element and everything works fine (I added an r in your quote replacements so python considers then as strings literals and replace them properly):

with open("metadata.json", "r") as f:
     for line in f.readline().split("{")[1:]:
         print(json.loads("{" + line).replace(r"\'", r"\""))

As suggested in the comments, I would actually recommend recreating the file or saving a new version where you replace }{ by }\n{:

with open("metadata.json", "r") as f:
    data = f.read()
data_lines = data.replace("}{","}\n{")
with open("metadata_mod.json", "w") as f:
    f.write(data_lines)

That way you will have the metadata of a paper per line as you want.

edited Nov 19 '20 at 17:38

answered Nov 19 '20 at 17:28

wagnifico

632
3
13

Thank you! I think recreating the entire file is a great solution. – Mr.Robot Nov 19 '20 at 17:38
You are welcome. I just edited the answer with a correction. You don't need to remove the replace, just use the proper syntax. – wagnifico Nov 19 '20 at 17:39

Reading a text file of dictionaries stored in one line

Question

MWE

1 Answers1