I am trying to read a file with JSON data (3.1M+ records). I am trying to test memory and time efficiency between reading whole file as once vs reading file line by line.
File1 is serialized JSON data that is one list with 3.1M+ dictionaries with size 811M.
File2 is serialized JSON data that has each line as a dictionary. Totally there are 3.1M+ lines with size 480M.
Profile info while reading file1
(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json
3108779
Filename: read_wholefile.json
Line # Mem usage Increment Line Contents
================================================
5 9.4 MiB 0.0 MiB @profile
6 def read_file():
7 9.4 MiB 0.0 MiB f = open("File1.json")
8 3725.3 MiB 3715.9 MiB f_json = json.loads(f.read())
9 3725.3 MiB 0.0 MiB print len(f_json)
23805 function calls (22916 primitive calls) in 30.230 seconds
Profile info while reading file2
(flask)chitturiLaptop:data kiran$ python -m cProfile read_line_by_line.json
3108779
Filename: read_line_by_line.json
Line # Mem usage Increment Line Contents
================================================
4 9.4 MiB 0.0 MiB @profile
5 def read_file():
6 9.4 MiB 0.0 MiB data_json = []
7 9.4 MiB 0.0 MiB with open("File2.json") as f:
8 3726.2 MiB 3716.8 MiB for line in f:
9 3726.2 MiB 0.0 MiB data_json.append(json.loads(line))
10 3726.2 MiB 0.0 MiB print len(data_json)
28002875 function calls (28001986 primitive calls) in 244.282 seconds
According to this SO post should it not take less memory to iterate through file2? Reading whole file and loading it through JSON took less time too.
I am running python 2.7.2 on MAC OSX 10.8.5.
EDIT
profile info with json.load
(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json
3108779
Filename: read_wholefile.json
Line # Mem usage Increment Line Contents
================================================
5 9.4 MiB 0.0 MiB @profile
6 def read_file():
7 9.4 MiB 0.0 MiB f = open("File1.json")
8 3725.3 MiB 3715.9 MiB f_json = json.load(f)
9 3725.3 MiB 0.0 MiB print len(f_json)
10 3725.3 MiB 0.0 MiB f.close()
23820 function calls (22931 primitive calls) in 27.266 seconds
EDIT2
Some statistics to support the answer.
(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json
3108779
Filename: read_wholefile.json
Line # Mem usage Increment Line Contents
================================================
5 9.4 MiB 0.0 MiB @profile
6 def read_file():
7 9.4 MiB 0.0 MiB f = open("File1.json")
8 819.9 MiB 810.6 MiB serialized = f.read()
9 4535.8 MiB 3715.9 MiB deserialized = json.loads(serialized)
10 4535.8 MiB 0.0 MiB print len(deserialized)
11 4535.8 MiB 0.0 MiB f.close()
23856 function calls (22967 primitive calls) in 26.815 seconds