I implemented my second suggestion: this only works if the schema is flat (there are no nested objects in the JSON file). I also did not check what happens if a value in the JSON file is a dictionary, but probably if would be handled more carefully, as I currently check for }
in a line to decide if the object is over.
You still need to load the entire IDs
file, you need to check somehow if the object is needed.
If the useful_objects
list grows too large, you can easily save that periodically while parsing the file.
import json
from pathlib import Path
import re
from typing import Dict
schema_name = "schema.json"
schema_path = Path(schema_name)
ids_name = "IDs.txt"
ids_path = Path(ids_name)
# read the ids
useful_ids = set()
with ids_path.open() as id_f:
for line in id_f:
id_ = line.strip()
useful_ids.add(id_)
print(useful_ids)
useful_objects = []
temp: Dict[str, str] = {}
was_useful = False
with schema_path.open() as sc_f:
for line in sc_f:
# remove start/end whitespace
line = line.strip()
print(f"Parsing line {line}")
# an object is ending
if line[0] == "}":
# add it
if was_useful:
useful_objects.append(temp)
# reset the usefulness for the next object
was_useful = False
# reset the temp object
temp = {}
# parse the line
match = re.match(r'"(.*?)": "(.*)"', line)
# if this did not match, skip the line
if match is None:
continue
# extract the data from the regex match
key = match.group(1)
value = match.group(2)
print(f"\tMatched: {key} {value}")
# build the temp object incrementally
temp[key] = value
# check if this object is useful
if key == "id" and value in useful_ids:
was_useful = True
useful_json = json.dumps(useful_objects, indent=4)
print(useful_json)
Again, not very elegant and not very robust, but as long as you are aware of the limitations, it does the job.
Cheers!