10

When I load the file into json, pythons memory usage spikes to about 1.8GB and I can't seem to get that memory to be released. I put together a test case that's very simple:

with open("test_file.json", 'r') as f:
    j = json.load(f)

I'm sorry that I can't provide a sample json file, my test file has a lot of sensitive information, but for context, I'm dealing with a file in the order of 240MB. After running the above 2 lines I have the previously mentioned 1.8GB of memory in use. If I then do del j memory usage doesn't drop at all. If I follow that with a gc.collect() it still doesn't drop. I even tried unloading the json module and running another gc.collect.

I'm trying to run some memory profiling but heapy has been churning 100% CPU for about an hour now and has yet to produce any output.

Does anyone have any ideas? I've also tried the above using cjson rather than the packaged json module. cjson used about 30% less memory but otherwise displayed exactly the same issues.

I'm running Python 2.7.2 on Ubuntu server 11.10.

I'm happy to load up any memory profiler and see if it does better then heapy and provide any diagnostics you might think are necessary. I'm hunting around for a large test json file that I can provide for anyone else to give it a go.

Endophage
  • 21,038
  • 13
  • 59
  • 90
  • Please try it with another file. Are you running interactive session or python script file? Do both show the same problem? – heltonbiker Jun 15 '12 at 20:33
  • Related: http://stackoverflow.com/questions/2400643/is-there-a-memory-efficient-and-fast-way-to-load-big-json-files-in-python – ChristopheD Jun 15 '12 at 20:33
  • @ChristopheD My issue is with the memory never being released. I don't care so much that a lot of memory is used during parsing. – Endophage Jun 15 '12 at 20:34
  • 1
    @heltonbiker I've tried it with a few different files. The memory usage seems to be directly related to the size of the json but the issue with not freeing the memory is universal. – Endophage Jun 15 '12 at 20:35
  • @heltonbiker Sorry, didn't address other part of your question. This happens both in interpreter and when run as a script file (and under mod_wsgi and uwsgi for that matter). – Endophage Jun 15 '12 at 20:47
  • 1
    1) Absolutely positive the parsed JSON objects are *no longer strongly reachable*? 2) Does the memory keep going up and up -- e.g. will it reach 3GB if doing the import twice? -- or only go up to "the largest dataset"? (I am not sure if Python can/does "release memory to the operating system", but many run-times can't, or won't, do this. That is, the GC can run and reclaim memory for the Python engine, but it won't necessarily give it back to the OS and the process will still show high values in `top`, etc.) –  Jun 15 '12 at 20:49
  • @pst I thought the same thing about releasing memory back to the os but simply creating a large array of strings which pushes python's memory usage up to many GB, the calling `del` on the array shows the memory being released back to the OS. – Endophage Jun 15 '12 at 21:02
  • 1
    @Endophage: That _could_ just mean the array happens to be contiguous pages (or even contiguous pages at the top of memory), while the JSON objects are scattered around all over the place, often sharing pages with the still-live results, and therefore can't be returned to the OS. – abarnert Jun 15 '12 at 21:04
  • @abarnet that's possible... certainly wouldn't be intuitive given all I do is create then delete the object. That would suggest the json lib is indeed leaking memory and leaving fragments interleaved with the json objects – Endophage Jun 15 '12 at 21:06

1 Answers1

15

I think these two links address some interesting points about this not necessarily being a json issue, but rather just a "large object" issue and how memory works with python vs the operating system

See Why doesn't Python release the memory when I delete a large object? for why memory released from python is not necessarily reflected by the operating system:

If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.

About running large object processes in a subprocess to let the OS deal with cleaning up:

The only really reliable way to ensure that a large but temporary use of memory DOES return all resources to the system when it's done, is to have that use happen in a subprocess, which does the memory-hungry work then terminates. Under such conditions, the operating system WILL do its job, and gladly recycle all the resources the subprocess may have gobbled up. Fortunately, the multiprocessing module makes this kind of operation (which used to be rather a pain) not too bad in modern versions of Python.

Community
  • 1
  • 1
jdi
  • 90,542
  • 19
  • 167
  • 203
  • 2
    Please make sure to include relevant excerpts/examples to counteract the non-stable nature of datum on the internet :) –  Jun 15 '12 at 20:57
  • Very frustrating. It does appear your first snippet describes the situation accurately. – Endophage Jun 15 '12 at 21:07
  • @Endophage: Yea, I actually remember seeing a similar question like this a few months ago, about large json files. The OP was going through like 4 different json libs trying to find the most memory efficient – jdi Jun 15 '12 at 21:41
  • I think I'm going to go the multiprocessing route. Then I can kill the process once all the processing is done and my main process maintains a small memory footprint. – Endophage Jun 15 '12 at 21:45
  • 1
    @jdi awesome! you're the man! for anyone interested I was sent this great video on how the os allocates memory from PyCon 2012 http://pyvideo.org/video/717/python-linkers-and-virtual-memory – Endophage Jun 16 '12 at 05:52