0

Summary: I am currently working on a project where I need to parse extremely large JSON files (over 10GB) in Python, and I am looking for ways to optimize the performance of my parsing code. I have tried using the json module in Python, but it is taking too long to load the entire file into memory. I am wondering if there are any alternative libraries or techniques that senior developers have used to handle such large JSON files in Python.

Explanation: I am working on a project where I need to analyze and extract data from very large JSON files. The files are too large to be loaded into memory all at once, so I need to find an efficient way to parse them. I have tried using the built-in json module in Python, but it is taking a long time to load the file into memory. I have also tried using ijson and jsonlines, but the performance is still not satisfactory. I am looking for suggestions on alternative libraries or techniques that could help me optimize my parsing code and speed up the process.

Example of the JSON:

{
  "orders": [
    {
      "order_id": "1234",
      "date": "2022-05-10",
      "total_amount": 245.50,
      "customer": {
        "name": "John Doe",
        "email": "johndoe@example.com",
        "address": {
          "street": "123 Main St",
          "city": "Anytown",
          "state": "CA",
          "zip": "12345"
        }
      },
      "items": [
        {
          "product_id": "6789",
          "name": "Widget",
          "price": 20.00,
          "quantity": 5
        },
        {
          "product_id": "2345",
          "name": "Gizmo",
          "price": 15.50,
          "quantity": 4
        }
      ]
    },
    {
      "order_id": "5678",
      "date": "2022-05-09",
      "total_amount": 175.00,
      "customer": {
        "name": "Jane Smith",
        "email": "janesmith@example.com",
        "address": {
          "street": "456 Main St",
          "city": "Anytown",
          "state": "CA",
          "zip": "12345"
        },
        "phone": "555-555-1212"
      },
      "items": [
        {
          "product_id": "9876",
          "name": "Thingamajig",
          "price": 25.00,
          "quantity": 3
        },
        {
          "product_id": "3456",
          "name": "Doodad",
          "price": 10.00,
          "quantity": 10
        }
      ]
    },
    {
      "order_id": "9012",
      "date": "2022-05-08",
      "total_amount": 150.25,
      "customer": {
        "name": "Bob Johnson",
        "email": "bjohnson@example.com",
        "address": {
          "street": "789 Main St",
          "city": "Anytown",
          "state": "CA",
          "zip": "12345"
        },
        "company": "ABC Inc."
      },
      "items": [
        {
          "product_id": "1234",
          "name": "Whatchamacallit",
          "price": 12.50,
          "quantity": 5
        },
        {
          "product_id": "5678",
          "name": "Doohickey",
          "price": 7.25,
          "quantity": 15
        }
      ]
    }
  ]
}

Version: Python 3.8

Here's what I tried:

import json

with open('large_file.json') as f:
    data = json.load(f)
import ijson

filename = 'large_file.json'
with open(filename, 'r') as f:
    parser = ijson.parse(f)
    for prefix, event, value in parser:
        if prefix.endswith('.name'):
            print(value)
import jsonlines

filename = 'large_file.json'
with open(filename, 'r') as f:
    reader = jsonlines.Reader(f)
    for obj in reader:
        print(obj)
pppery
  • 3,731
  • 22
  • 33
  • 46
  • 7
    Without being flippant, don't start from here. JSON is hopeless at large data objects precisely for the reaon you've found. Can you change the format of the datafiles you're getting? Instead of a single monolithic JSON object, have it sent as newline-delimited JSON (NDJSON), where each line is a JSON object containing a single order. If getting the file produced in this format is not possible, a simple parser should be able to handle the format change for you. – Tangentially Perpendicular May 10 '23 at 04:17
  • @TangentiallyPerpendicular nice point. It makes a lot of sense. I'm sorry. I'm not in the perfect mental health status right now to have thought the same. – Wago Filho May 10 '23 at 04:23
  • You really should be using a database and querying that. Pay the once-off price to convert your JSON into properly normalised tables and your life will be *much* easier from then on. – Nick May 10 '23 at 04:41
  • Do you need to query or work on the same JSON file more than once? – Jarno Lamberg May 10 '23 at 08:52

1 Answers1

0

You could try to use Pandas, as in theory Pandas can also handle json, or you could even try using SQLITE, as it can parse JSON, store JSON in columns and also query JSON. But I would recommend that you use Pandas as it is easier to use and has more documentation online. You could do it like this in Pandas -

import pandas as pd
file = pd.read_json("your-filename.json")
print(file)
Xcape9797
  • 57
  • 3