How to load a 20GB json file to read in pandas?

Question

I have a JSON file that is 19.4 GB in size. I tried a lot of methods to read the file. For example: pandas.read_json(filename) simply crashes the notebook. I am looking for ways to load the file in a lazy way. Like, 1GB at a time and then dump it in a SQLite or a neo4j db to analyze the data. Any ideas for this will be really appreciated.

if you have it on linux, you could use @Ferris' suggestion below, or use `jq`, since you are moving data into a db. — sammywemmy, Jan 05 '21 at 07:00

score 0 · Answer 1 · answered Jan 05 '21 at 05:14

May be you could try PySpark as it would be distributed and lazy also, PySpark APIs could be used to analyze the data in memory and if required dataframe could be dumped to a database.

import pyspark
from pyspark import SparkConf

conf = SparkConf()
conf.set('spark.local.dir', '/remote/data/match/spark')
conf.set('spark.sql.shuffle.partitions', '2100')
SparkContext.setSystemProperty('spark.executor.memory', '10g')
SparkContext.setSystemProperty('spark.driver.memory', '10g')
sc = SparkContext(appName='mm_exp', conf=conf)
sqlContext = pyspark.SQLContext(sc)

data = sqlContext.read.json(file.json)

score 0 · Answer 2 · answered Jan 05 '21 at 05:16

split by row, if the JSON file's data is separated row by row.

# In bash, split by 1,000,000 lines per file
split -d -l 1000000 file_name.json file_part_

then handle each file.

or use chunksize:

df_iter = pd.read_csv(filename, chunksize = 100000, sep='\n', header=None)
for df in df_iter:
    # handle the dataframe
    break

How to load a 20GB json file to read in pandas?

2 Answers2

Linked