4

I am trying to read a file with JSON data (3.1M+ records). I am trying to test memory and time efficiency between reading whole file as once vs reading file line by line.

File1 is serialized JSON data that is one list with 3.1M+ dictionaries with size 811M.

File2 is serialized JSON data that has each line as a dictionary. Totally there are 3.1M+ lines with size 480M.

Profile info while reading file1

(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json 
3108779
Filename: read_wholefile.json

Line #    Mem usage    Increment   Line Contents
================================================
 5      9.4 MiB      0.0 MiB   @profile
 6                             def read_file():
 7      9.4 MiB      0.0 MiB     f = open("File1.json")
 8   3725.3 MiB   3715.9 MiB     f_json  = json.loads(f.read())
 9   3725.3 MiB      0.0 MiB     print len(f_json)


     23805 function calls (22916 primitive calls) in 30.230 seconds

Profile info while reading file2

(flask)chitturiLaptop:data kiran$ python -m cProfile read_line_by_line.json 
3108779
Filename: read_line_by_line.json

 Line #    Mem usage    Increment   Line Contents
 ================================================
 4      9.4 MiB      0.0 MiB   @profile
 5                             def read_file():
 6      9.4 MiB      0.0 MiB     data_json = []
 7      9.4 MiB      0.0 MiB     with open("File2.json") as f:
 8   3726.2 MiB   3716.8 MiB       for line in f:
 9   3726.2 MiB      0.0 MiB         data_json.append(json.loads(line))
10   3726.2 MiB      0.0 MiB     print len(data_json)


     28002875 function calls (28001986 primitive calls) in 244.282 seconds

According to this SO post should it not take less memory to iterate through file2? Reading whole file and loading it through JSON took less time too.

I am running python 2.7.2 on MAC OSX 10.8.5.

EDIT

profile info with json.load

(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json 
3108779
Filename: read_wholefile.json

Line #    Mem usage    Increment   Line Contents
================================================
 5      9.4 MiB      0.0 MiB   @profile
 6                             def read_file():
 7      9.4 MiB      0.0 MiB     f = open("File1.json")
 8   3725.3 MiB   3715.9 MiB     f_json  = json.load(f)
 9   3725.3 MiB      0.0 MiB     print len(f_json)
10   3725.3 MiB      0.0 MiB     f.close()


     23820 function calls (22931 primitive calls) in 27.266 seconds

EDIT2

Some statistics to support the answer.

(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json 
3108779
Filename: read_wholefile.json

Line #    Mem usage    Increment   Line Contents
================================================
 5      9.4 MiB      0.0 MiB   @profile
 6                             def read_file():
 7      9.4 MiB      0.0 MiB     f = open("File1.json")
 8    819.9 MiB    810.6 MiB     serialized = f.read()
 9   4535.8 MiB   3715.9 MiB     deserialized  = json.loads(serialized)
10   4535.8 MiB      0.0 MiB     print len(deserialized)
11   4535.8 MiB      0.0 MiB     f.close()


     23856 function calls (22967 primitive calls) in 26.815 seconds
Community
  • 1
  • 1
kich
  • 734
  • 2
  • 9
  • 23
  • Why is file 2 so much smaller? Is file 1 full of unnecessary whitespace? – user2357112 Dec 28 '13 at 06:01
  • Yeah, file1 has indentation of 4 :). file2 has no indentation – kich Dec 28 '13 at 06:04
  • 1
    I suspect you're not seeing the memory taken by reading the whole file at once, as it's discarded before the source line finishes executing. Instead, you're mostly seeing the memory consumed by the giant deserialized data structure, which is the same in both cases. – user2357112 Dec 28 '13 at 06:05
  • 2
    Note that `json.load(f)` is a better way to deserialize JSON from a file or file-like object than either way you tried. – user2357112 Dec 28 '13 at 06:06
  • When you say better, do you mean better in terms of memory or something else? I did not notice anything in terms of memory (posted results above)? Incase if I am not seeing the memory taken by reading whole file, how can I capture that memory ? – kich Dec 28 '13 at 06:11
  • The simplest way to see the memory taken by reading the whole file would be to assign it to a variable. Use `serialized = f.read(); deserialized = json.loads(serialized)`. As for "better", it should run faster, consume less memory, and be more concise in the source code. – user2357112 Dec 28 '13 at 06:22
  • You are right. Reading whole file takes 810.6MB more than what is reported my profiler before. I guess `json.load(f)` is the better way of doing. Are there any better ways to do for json serialization and deserialization to reduce the memory consumption ? – kich Dec 28 '13 at 06:42
  • @user2357112 Would you like to submit your comment as answer ? or do you want me to post with the results ? I want to close this post since you already answered my question – kich Dec 28 '13 at 10:06

1 Answers1

3

Your first test doesn't show the memory consumed by reading the whole file into a giant string, since the giant string is discarded before the source line finishes and the profiler isn't showing you memory consumption in the middle of a line. If you save the string to a variable:

serialized = f.read()
deserialized = json.loads(serialized)

you'll see the 811 MB memory consumption for the temporary string. The ~3725 MB you're seeing in both tests is mostly the deserialized data structure, which is the same in both tests.

Finally, note that json.load(f) is a faster, more concise, and more memory-friendly way to load JSON data from a file than either json.loads(f.read()) or line-by-line iteration.

user2357112
  • 260,549
  • 28
  • 431
  • 505