0

What is the most efficient way to upload a large, pretty-printed JSON file with a billion lines to a pandas DataFrame?

Exemplar data from the file (for brevity, only two objects are presented here out of millions):

[
{
"TYPE":"DCF3",
"UID":{
"Number":"ABRRRT1",
"Date":"21.01.2022"
},
"UIDC": "0001"
}
,
{
"TYPE":"DCF3",
"UID":{ 
"Number":"ABRRRT12",
"Date":"22.01.2022"
},
"UIDC": "0001"
}
]

The desired outcome as a Pandas DataFrame:

TYPE UID UIDC
0 DCF3 {'Number': 'ABRRRT1', 'Date': '21.01.2022'} 0001
1 DCF3 {'Number': 'ABRRRT12', 'Date': '21.01.2022'} 0001

I tried several approaches on my laptop with 16 GB RAM:

  • Pandas read_json() - RAM runs out and the kernel dies, when lines=True throws the error ValueError: Expected object or value
  • Python json.load() - RAM runs out and the kernel dies.
  • Pyspark spark.read.json() - df.printSchema() outputs _corrupt_record: string (nullable = true)
  • Split the file into several, focusing on the combination of characters {\n, and upload the resulting json to pandas separately - in general, it worked for me.

Software versions:

  • python: 3.8.0
  • IPython: 8.12.0
  • jupyter-notebook: 6.5.4
  • pandas: 2.0.2
  • pyspark: 3.0.1
MaxFall
  • 3
  • 2
  • Is there something specific you will be doing with this data? Does it need to all be loaded at once? You can convert it into ndjson/jsonlines using a streaming json parser e.g. `json-streamer`, `ijson` but creating a single pandas dataframe would still fail. Depending on what you need to do you could then process in chunks, or use something that supports larger than ram datasets, pyspark, polars, duckdb, dask etc. – jqurious Jun 03 '23 at 17:04
  • Mainly I would like to be able to apply to the data, for example, counting values of a certain type in a column and other simple manipulations, and it would be more convenient if all the data were loaded at once. Thanks for the help. – MaxFall Jun 03 '23 at 18:17

1 Answers1

0

Iterative JSON parsing

The tool I recommend for parsing huge JSON files is ijson.

Here's an example of how to use ijson to read a JSON list.

# replace this with the filename of your file
with open('test364.json') as f:
    items = ijson.items(f, 'item')
    for item in items:
        print(item)

The advantage of ijson is that it doesn't read the entire file in at once. It has only one element in memory at once.

What you can do is read in a few lines, convert it to DataFrame to save memory, then keep a list of DataFrames you combine in the last step.

Here's what that might look like:

import itertools
import pandas as pd
import ijson


def batched(iterable, n):
    # https://stackoverflow.com/questions/8991506/iterate-an-iterator-by-chunks-of-n-in-python
    "Batch data into tuples of length n. The last batch may be shorter."
    # batched('ABCDEFG', 3) --> ABC DEF G
    if n < 1:
        raise ValueError('n must be at least one')
    it = iter(iterable)
    while (batch := tuple(itertools.islice(it, n))):
        yield batch


def read_json_chunked(filename, chunksize):
    """Reads in JSON from filename and yields batches of items."""
    with open(filename) as f:
        items = ijson.items(f, 'item')
        yield from batched(items, chunksize)

df_list = []
chunksize = 3  # Change this to a higher value for speed
for chunk in read_json_chunked('test364.json', chunksize):
    df_list.append(pd.DataFrame(chunk))
df_final = pd.concat(df_list, ignore_index=True)
print(df_final)

Note: above I use a chunk size of 3 for testing. You should use a much larger chunksize - around 10,000 to 100,000.

Efficient Memory Dtypes

The above DataFrame is fairly inefficient in how it uses memory. You can find how much memory it uses for each row with this expression:

>>> df_final.memory_usage(deep=True).sum() / len(df_final)
346.64

There are several things you can do to reduce memory usage. Each of the code examples below is running on the individual chunks of the dataframe, because if you wait until you call pd.concat(), then you've already run out of memory.

  • Split UID column. The UID column takes the largest amount of memory, because for each row, it must store four strings and one dictionary object. You're only really interested in two of those strings. I'd suggest using pd.json_normalize():

    for chunk in read_json_chunked('test364.json', chunksize):
        df_chunk = pd.json_normalize(chunk)
        df_list.append(df_chunk)
    

    This produces a slightly different dataframe, with UID split into two columns.

         TYPE  UIDC UID.Number    UID.Date
    0    DCF3  0001    ABRRRT1  21.01.2022
    1    DCF3  0001   ABRRRT12  22.01.2022
    2    DCF3  0001    ABRRRT1  21.01.2022
    3    DCF3  0001   ABRRRT12  22.01.2022
    4    DCF3  0001    ABRRRT1  21.01.2022
    ..    ...   ...        ...         ...
    195  DCF3  0001   ABRRRT12  22.01.2022
    196  DCF3  0001    ABRRRT1  21.01.2022
    197  DCF3  0001   ABRRRT12  22.01.2022
    198  DCF3  0001    ABRRRT1  21.01.2022
    199  DCF3  0001   ABRRRT12  22.01.2022
    
    [200 rows x 4 columns]
    

    Warning: this assumes that the keys within each dictionary are mostly the same. If every value of UID contains different keys, this could end up using more memory.

    (See table below for how much memory this saves.)

  • Convert UIDC to integer. Strings in Python have significant overhead.

    df_chunk['UIDC'] = df_chunk['UIDC'].astype('int32')
    

    Warning: this makes the assumption that all UIDC values are digits, and that the leading zeros of UIDC don't matter.

  • Convert UID.Date to datetime. Same argument as before - datetime is much smaller because it is only 8 bytes per date.

    df_chunk['UID.Date'] = pd.to_datetime(df_chunk['UID.Date'], format='%d.%m.%Y')
    
  • Convert TYPE to categorical. Assuming that the column TYPE mostly contains duplicate values (more than 50%) you can save memory by using a categorical value. This stores a list of unique values in the column, then represents each value using an integer to index into that list.

    df_chunk['TYPE'] = df_chunk['TYPE'].astype('category')
    
  • Convert UID.Number to PyArrow backed string. Since this column contains non-number values, the integer trick won't work. Since it is different for each row (I assume) the categorical trick won't work.

    However, you can still reduce the memory usage by using the string dtype, with the storage type set to pyarrow. This reduces the string overhead from ~40 bytes to ~4 bytes.

    df_chunk['UID.Number'] = df_chunk['UID.Number'].astype('string[pyarrow]')
    

    You can read more about the string dtype here.

Combined, these tricks can reduce the memory usage of the DataFrame by 10x.

Method Bytes Per Row
Original 346.64
Split UID column 254.14
UIDC to integer 197.14
UID.DATE to datetime64 138.14
TYPE converted to categorical 78.99
UID.Number to pyarrow string 25.99

(Each row includes all optimizations from previous rows. All measurements were done on a 200 row DataFrame.)

Nick ODell
  • 15,465
  • 3
  • 32
  • 66