Python/JSON Loaded Object Size Varies from File Size

Question

Recently, I had data in a users.json file which was taking a lot of time to load in VsCode as the file was too large (surprising to me because it was a 29mb file), I wanted to use this chance to play around with pythons' memory usage, I loaded the file all into memory and it worked as expected.

Although I have a question, more of me needing an explanation, forgive me if its' answer is too obvious;

When I made an introspection on the loaded json object, I found out that the object size (1.3mb) was way less that the file size (29.6mb) on my file system (MacOS), how could this be? The difference in size is just too much to ignore. To make things worse, i had a smaller file and that file returned similar size results (on-disk/loaded, ~358kb), haha.

import json

with open('users.json') as infile:
    data = json.load(infile)
    print(f'Object Item Count: {len(data):,} items \nObject Size: {data.__sizeof__():,} bytes)

Using sys.getsizeof(data) would return something similar, maybe with some gc overhead.

This returns the accurate size of the file on disk (29586765 bytes, 29mb)

from pathlib import Path

Path('users.json').stat().st_size

Please can someone explain to me what is happening, one would think that there should be similarity in size or maybe i'm wrong.

AKX · Accepted Answer · 2020-04-27T12:14:35.280

4

sys.getsizeof() doesn't recurse into objects:

Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.

All of the strings, numbers, etc. that get loaded from your JSON file are those aforementioned "objects being referred to".

For a more accurate result, you could

measure the memory usage of your Python process pre and post load (e.g. https://stackoverflow.com/a/21632554/51685)
use a memory profiler (e.g. Which Python memory profiler is recommended?)

That said, though, some objects will be smaller in memory than on disk; for instance, a large number, say, 36 << 921 is 279 bytes on disk and sys.getsizeof() pins it at 148 bytes in memory. Similarly, a smart enough JSON decoder (which, afaik, the built-in json is not which the default JSON decoder actually does, see https://github.com/python/cpython/commit/7d6e076f6d8dd48cfd748b02dad17dbeb0b346a3) could share objects for repeating dict keys.

edited Apr 27 '20 at 12:14

answered Apr 27 '20 at 09:46

AKX

152,115
15
115
172

Thanks, makes sense, but if i load a json file, it's loaded into one object and this object doesn't refer to any object, rather it should then be referred to. The data within the object is within the object, so the talks of referencing still confuses me. Although python officially says one should do [this](https://code.activestate.com/recipes/577504/) as an example of using `getsizeof()` recursively to find the size of containers and all their contents, although they say so, but it should be more intuitive that than imo. It's really misleading. – nosahama Apr 27 '20 at 10:22
1

@nosahama it's only "misleading" when you "misunderstand" Python's data model and [how python's variables really work](https://nedbatchelder.com/text/names.html). Here your assertions that "The data within the object is within the object" and " this object doesn't refer to any object" are just plain wrong (cf the link above) - Python's objects are stored in a "blackbox" heap space, and they never "contains" anything, they always only _refer_ to other objects. This is Python, not C - different language, different data model. – bruno desthuilliers Apr 27 '20 at 11:32
@AKX "repeating dict keys" don't exist - by definition a dict key is unique -, and sharing repeated objects would lead to unexpected behaviours if you mutate any of those shared objects. – bruno desthuilliers Apr 27 '20 at 11:35
@brunodesthuilliers I mean parsing `[{"foo": "bar"}, {"foo": "quux"}]` will likely create two string objects with the content `"foo"`. As you know, strings are immutable, so a smarter JSON decoder _could_ share them, as well as integers. – AKX Apr 27 '20 at 11:37
@AKX that's clearer indeed ;-) - just note that"foo"` is a valid Python identifier and as such it's already interned by the CPython runtime - and so are "small" integers (for a definition of "small" that changed in time - don't know what the current spec is). But as long as we're talink immutable objects, a json parser could indeed implement more aggressive caching, but since json documents are usualy rather small (a 29mb json doc is crazy as far as I'm concerned), I'm not sure the overhead would bring much improvement for most cases. – bruno desthuilliers Apr 27 '20 at 11:46
29 megs of JSON is not crazy at all in my books... for instance, e.g. GeoJSON files could be rather large and have very repetitive keys. We actually bumped into strings _not_ being shared in the context of the Babel project: https://github.com/python-babel/babel/issues/571 – AKX Apr 27 '20 at 11:54
Hey, who knew, the JSON module actually _does_ memoize object keys! https://github.com/python/cpython/commit/7d6e076f6d8dd48cfd748b02dad17dbeb0b346a3 – AKX Apr 27 '20 at 12:13

Python/JSON Loaded Object Size Varies from File Size

1 Answers1