Iterative JSON parsing
The tool I recommend for parsing huge JSON files is ijson.
Here's an example of how to use ijson to read a JSON list.
# replace this with the filename of your file
with open('test364.json') as f:
items = ijson.items(f, 'item')
for item in items:
print(item)
The advantage of ijson is that it doesn't read the entire file in at once. It has only one element in memory at once.
What you can do is read in a few lines, convert it to DataFrame to save memory, then keep a list of DataFrames you combine in the last step.
Here's what that might look like:
import itertools
import pandas as pd
import ijson
def batched(iterable, n):
# https://stackoverflow.com/questions/8991506/iterate-an-iterator-by-chunks-of-n-in-python
"Batch data into tuples of length n. The last batch may be shorter."
# batched('ABCDEFG', 3) --> ABC DEF G
if n < 1:
raise ValueError('n must be at least one')
it = iter(iterable)
while (batch := tuple(itertools.islice(it, n))):
yield batch
def read_json_chunked(filename, chunksize):
"""Reads in JSON from filename and yields batches of items."""
with open(filename) as f:
items = ijson.items(f, 'item')
yield from batched(items, chunksize)
df_list = []
chunksize = 3 # Change this to a higher value for speed
for chunk in read_json_chunked('test364.json', chunksize):
df_list.append(pd.DataFrame(chunk))
df_final = pd.concat(df_list, ignore_index=True)
print(df_final)
Note: above I use a chunk size of 3 for testing. You should use a much larger chunksize - around 10,000 to 100,000.
Efficient Memory Dtypes
The above DataFrame is fairly inefficient in how it uses memory. You can find how much memory it uses for each row with this expression:
>>> df_final.memory_usage(deep=True).sum() / len(df_final)
346.64
There are several things you can do to reduce memory usage. Each of the code examples below is running on the individual chunks of the dataframe, because if you wait until you call pd.concat()
, then you've already run out of memory.
Split UID column. The UID column takes the largest amount of memory, because for each row, it must store four strings and one dictionary object. You're only really interested in two of those strings. I'd suggest using pd.json_normalize()
:
for chunk in read_json_chunked('test364.json', chunksize):
df_chunk = pd.json_normalize(chunk)
df_list.append(df_chunk)
This produces a slightly different dataframe, with UID split into two columns.
TYPE UIDC UID.Number UID.Date
0 DCF3 0001 ABRRRT1 21.01.2022
1 DCF3 0001 ABRRRT12 22.01.2022
2 DCF3 0001 ABRRRT1 21.01.2022
3 DCF3 0001 ABRRRT12 22.01.2022
4 DCF3 0001 ABRRRT1 21.01.2022
.. ... ... ... ...
195 DCF3 0001 ABRRRT12 22.01.2022
196 DCF3 0001 ABRRRT1 21.01.2022
197 DCF3 0001 ABRRRT12 22.01.2022
198 DCF3 0001 ABRRRT1 21.01.2022
199 DCF3 0001 ABRRRT12 22.01.2022
[200 rows x 4 columns]
Warning: this assumes that the keys within each dictionary are mostly the same. If every value of UID contains different keys, this could end up using more memory.
(See table below for how much memory this saves.)
Convert UIDC to integer. Strings in Python have significant overhead.
df_chunk['UIDC'] = df_chunk['UIDC'].astype('int32')
Warning: this makes the assumption that all UIDC values are digits, and that the leading zeros of UIDC don't matter.
Convert UID.Date to datetime. Same argument as before - datetime is much smaller because it is only 8 bytes per date.
df_chunk['UID.Date'] = pd.to_datetime(df_chunk['UID.Date'], format='%d.%m.%Y')
Convert TYPE to categorical. Assuming that the column TYPE mostly contains duplicate values (more than 50%) you can save memory by using a categorical value. This stores a list of unique values in the column, then represents each value using an integer to index into that list.
df_chunk['TYPE'] = df_chunk['TYPE'].astype('category')
Convert UID.Number to PyArrow backed string. Since this column contains non-number values, the integer trick won't work. Since it is different for each row (I assume) the categorical trick won't work.
However, you can still reduce the memory usage by using the string dtype, with the storage type set to pyarrow. This reduces the string overhead from ~40 bytes to ~4 bytes.
df_chunk['UID.Number'] = df_chunk['UID.Number'].astype('string[pyarrow]')
You can read more about the string dtype here.
Combined, these tricks can reduce the memory usage of the DataFrame by 10x.
Method |
Bytes Per Row |
Original |
346.64 |
Split UID column |
254.14 |
UIDC to integer |
197.14 |
UID.DATE to datetime64 |
138.14 |
TYPE converted to categorical |
78.99 |
UID.Number to pyarrow string |
25.99 |
(Each row includes all optimizations from previous rows. All measurements were done on a 200 row DataFrame.)