I have a JSON file that is 19.4 GB in size. I tried a lot of methods to read the file. For example: pandas.read_json(filename)
simply crashes the notebook. I am looking for ways to load the file in a lazy way. Like, 1GB at a time and then dump it in a SQLite or a neo4j db to analyze the data. Any ideas for this will be really appreciated.
Asked
Active
Viewed 920 times
0

user4157124
- 2,809
- 13
- 27
- 42

Monowar Anjum
- 41
- 1
- 7
-
if you have it on linux, you could use @Ferris' suggestion below, or use `jq`, since you are moving data into a db. – sammywemmy Jan 05 '21 at 07:00
2 Answers
0
May be you could try PySpark as it would be distributed and lazy also, PySpark APIs could be used to analyze the data in memory and if required dataframe could be dumped to a database.
import pyspark
from pyspark import SparkConf
conf = SparkConf()
conf.set('spark.local.dir', '/remote/data/match/spark')
conf.set('spark.sql.shuffle.partitions', '2100')
SparkContext.setSystemProperty('spark.executor.memory', '10g')
SparkContext.setSystemProperty('spark.driver.memory', '10g')
sc = SparkContext(appName='mm_exp', conf=conf)
sqlContext = pyspark.SQLContext(sc)
data = sqlContext.read.json(file.json)

hagarwal
- 1,153
- 11
- 27
0
split by row, if the JSON file's data is separated row by row.
# In bash, split by 1,000,000 lines per file
split -d -l 1000000 file_name.json file_part_
then handle each file.
or use chunksize:
df_iter = pd.read_csv(filename, chunksize = 100000, sep='\n', header=None)
for df in df_iter:
# handle the dataframe
break

Ferris
- 5,325
- 1
- 14
- 23