0

I have two large files:

  • one is a text file with a lot of IDs: one ID per row;
  • the other one is a 6+ GB json file, containing many items.

I need to search for those IDs in a certain field of the json file and copy the whole item it refers to for later analysis (creating a new file).

I give an example:

IDs.txt

    unique_id_1
    unique_id_2
    ...

schema.json

[
    {
        "id": "unique_id_1",
        "name": "",
        "text": "",
        "date": "",
    },
    {
        "id": "unique_id_aaa",
        "name": "",
        "text": "",
        "date": "",
    },
    {
        "id": "unique_id_2",
        "name": "",
        "text": "",
        "date": "",
    },
    ...
]

I am doing these analysis with Python - Pandas but I am getting troubles due to the large dimension of the files. What is the best way to do this thing? I can also consider using other software / languages

lasko
  • 17
  • 4
  • 1
    You could turn the `.json` file into a `.csv` once (hoping that operation fits in memory), then you have a file that is much easier to [process in chunks](https://stackoverflow.com/a/519653/2237151), with the entire information about each ID in a single line. If the file is too big for the conversion, splitting it is [not trivial](https://stackoverflow.com/a/10238541/2237151), but doable (and again, only needed once). – Pietro Mar 27 '21 at 09:35
  • Or, if you know the schema beforehand, you could parse the `json` file line by line, building each object as you read, and discarding it if the `id` does not match. Not super elegant, but it would work with less hassle. – Pietro Mar 27 '21 at 09:39

1 Answers1

0

I implemented my second suggestion: this only works if the schema is flat (there are no nested objects in the JSON file). I also did not check what happens if a value in the JSON file is a dictionary, but probably if would be handled more carefully, as I currently check for } in a line to decide if the object is over.

You still need to load the entire IDs file, you need to check somehow if the object is needed.

If the useful_objects list grows too large, you can easily save that periodically while parsing the file.

import json
from pathlib import Path
import re
from typing import Dict

schema_name = "schema.json"
schema_path = Path(schema_name)
ids_name = "IDs.txt"
ids_path = Path(ids_name)

# read the ids
useful_ids = set()
with ids_path.open() as id_f:
    for line in id_f:
        id_ = line.strip()
        useful_ids.add(id_)
print(useful_ids)

useful_objects = []
temp: Dict[str, str] = {}
was_useful = False

with schema_path.open() as sc_f:

    for line in sc_f:
        # remove start/end whitespace
        line = line.strip()
        print(f"Parsing line {line}")

        # an object is ending
        if line[0] == "}":
            # add it
            if was_useful:
                useful_objects.append(temp)
            # reset the usefulness for the next object
            was_useful = False
            # reset the temp object
            temp = {}

        # parse the line
        match = re.match(r'"(.*?)": "(.*)"', line)

        # if this did not match, skip the line
        if match is None:
            continue

        # extract the data from the regex match
        key = match.group(1)
        value = match.group(2)
        print(f"\tMatched: {key} {value}")

        # build the temp object incrementally
        temp[key] = value

        # check if this object is useful
        if key == "id" and value in useful_ids:
            was_useful = True

useful_json = json.dumps(useful_objects, indent=4)
print(useful_json)

Again, not very elegant and not very robust, but as long as you are aware of the limitations, it does the job.

Cheers!

Pietro
  • 1,090
  • 2
  • 9
  • 15
  • Yes, my schema is flat and your script is perfect. File sizes are still large but more manageable. Surely it's possible to reason to make it more robust and efficient even in case of nested json but for my purpose I confirm that it works great. Really thank you very much! – lasko Mar 27 '21 at 13:58