8

Python (and spyder) return a MemoryError when I load a JSON file which is 500Mo large.

But my computer have a 32Go RAM and the "memory" displayed by spyder go from 15% to 19% when I try to load it! It seems that I sould have much more space...

Something I didn't think of?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Agape Gal'lo
  • 687
  • 4
  • 9
  • 23

2 Answers2

16

500MB of JSON data does not result in 500MB of memory usage. It will result in a multiple of that. Exactly by what factor depends on the data, but a factor of 10 - 25 is not uncommon.

For example, the following simple JSON string of 14 characters (bytes on disk) results in a Python object is almost 25 times larger (Python 3.6b3):

>>> import json
>>> from sys import getsizeof
>>> j = '{"foo": "bar"}'
>>> len(j)
14
>>> p = json.loads(j)
>>> getsizeof(p) + sum(getsizeof(k) + getsizeof(v) for k, v in p.items())
344
>>> 344 / 14
24.571428571428573

That's because Python objects require some overhead; instances track the number of references to them, what type they are, and their attributes (if the type supports attributes) or their contents (in the case of containers).

If you are using the json built-in library to load that file, it'll have to build larger and larger objects from the contents as they are parsed, and at some point your OS will refuse to provide more memory. That won't be at 32GB, because there is a limit per process how much memory can be used, so more likely to be at 4GB. At that point all those objects already created are freed again, so in the end the actual memory use doesn't have to have changed that much.

The solution is to either break up that large JSON file into smaller subsets, or to use an event-driven JSON parser like ijson.

An event-driven JSON parser doesn't create Python objects for the whole file, only for the currently parsed item, and notifies your code for each item it created with an event (like 'starting an array, here is a string, now starting a mapping, this is the end of the mapping, etc.). You can then decide what data you need and keep, and what to ignore. Anything you ignore is discarded again and memory use is kept low.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Percect answer, as always from you ;) For the other, I add a link to an explanation of how ijson works: http://stackoverflow.com/questions/40330820/load-an-element-with-python-from-large-json-file – Agape Gal'lo Nov 03 '16 at 11:23
  • And is there a way to change the limit memory per process? – Agape Gal'lo Nov 03 '16 at 19:53
  • 1
    @AgapeGal'lo: I'm not sure you can, it looks to me that on Windows, you can only *decrease* the limits: [Set Windows process (or user) memory limit](//stackoverflow.com/q/192876). I strongly doubt that that is going to work out for you. – Martijn Pieters Nov 03 '16 at 19:55
  • @AgapeGal'lo: also see https://msdn.microsoft.com/en-gb/library/windows/desktop/aa366778(v=vs.85).aspx – Martijn Pieters Nov 03 '16 at 19:57
  • But, as I have a win10, 64 bytes, I sould be limited to 2TB. I tested the same program under umbuntu, and it perfectly works!! I tried to uninstall and re-install pythonxy, but nothing to do, it does not want to work... – Agape Gal'lo Nov 04 '16 at 18:37
  • @AgapeGal'lo: even on 64bit Windows systems, a 32-bit process is limited to either 2GB or 4GB of memory, based on the `IMAGE_FILE_LARGE_ADDRESS_AWARE` bit on the process. If you have a 64-bit Python binary (look for x86-64), then you can go up to 8TB, but only if the `IMAGE_FILE_LARGE_ADDRESS_AWARE` bit is set for the process. I don't know if that flag is set for the Python binaries you download from python.org however, I can find very little info on that, but with the flag *not* set you can only address 2GB whatever version of Python you have. – Martijn Pieters Nov 04 '16 at 18:54
  • "but only if the IMAGE_FILE_LARGE_ADDRESS_AWARE bit is set for the process" I do not understand. How can I set this? – Agape Gal'lo Nov 04 '16 at 18:56
  • @AgapeGal'lo: the MS documentation suggests it is set when compiling. [This answer](http://stackoverflow.com/a/25309556/100297) suggests you can set the flag later with the `editbin` tool. – Martijn Pieters Nov 04 '16 at 18:58
  • Yes, but as you said, going from 2 to 4 GB will not help a lot... But the [follwing answer](http://stackoverflow.com/a/20155448/6476722) is interesting. It seems the problem is fixed with the 64 bites version of python. I tried to intall it but at no moment, the installer asked me for 32 or 64 one.... – Agape Gal'lo Nov 04 '16 at 19:06
  • @AgapeGal'lo: 2gb -> 4gb can be enough. But if you have a x86-64 build of Python, you go from 2gb -> 8tb max. – Martijn Pieters Nov 04 '16 at 19:07
  • "x86 build of Python" and how do I get that? – Agape Gal'lo Nov 04 '16 at 19:14
  • @AgapeGal'lo: sorry, that should be x86-64. Every Python Windows installer comes in x86 and x86-64 flavours: https://www.python.org/downloads/windows/. – Martijn Pieters Nov 04 '16 at 19:43
1

So, I will explain how I finally solved this problem. The first answer will work. But you have to know that loading elements one per one with ijson will be very long... and by the end, you do not have the loaded file.

So, the important information is that windows limit your memory per process to 2 or 4 GB, depending on wich windows you use (32 or 64). If you use pythonxy, that will be 2 GB (it only exists in 32). Anyway, in both way, that's very very low!

I solved this problem by installing a virtual Linux in my windows, and it works. Here are the main step to do so:

  1. Install Virtual Box
  2. Install Ubuntu (for exemple)
  3. Install python for scientist on your computer, like SciPy
  4. Create a share file between the 2 "computers" (you will find tutorial on google)
  5. Execute your code on your ubuntu "computer": it sould work ;)

NB: Do not forget to allow sufficient RAM and memory to you virtual computer.

This works for me. I don't have anymore this "memory error" problem.

I post here this asnwer from there.

Community
  • 1
  • 1
Agape Gal'lo
  • 687
  • 4
  • 9
  • 23