How to get the equivalent file size of a json object in Python?

Question

I am writing a python object with json.dump.

But I want to only write objects that would not exceed 10KB file size.

How can estimate the size of an object before writing?

DocZerø · Answer 1 · 2022-04-05T09:28:46.107

Here's my take on it.

We start with the following sample (taken from here, with a little extra added to make it interesting):

sample_orig = """{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    },
    "a little extra" : "∫ßåøπœ®†"
}"""

Next, we define a test function to perform the encoding and output the size:

def encode_sample(sample : str):
    for encoding in ('ascii', 'utf8', 'utf16'):
        filename = f'{encoding}.json'
        encoded_sample = sample.encode(encoding=encoding, errors='replace')
        with FileIO(filename, mode='wb') as f:
            f.write(encoded_sample)
            assert len(encoded_sample) == f.tell()
            print(f'{encoding}: {f.tell()} bytes')

The assert will prove that the size reported by len is the same if we're dealing with a bytes (not str). If not, the call will raise AssertionError.

We'll encode the original sample first:

encode_sample(sample_orig)

Output:

ascii: 617 bytes
utf8: 627 bytes
utf16: 1236 bytes

Next, we run it through json.loads() and json.dumps() to "optimise" the size (i.e. remove unnecessary whitespace):

sample_reduced = json.dumps(json.loads(sample_orig))
encode_sample(sample_reduced)

Output:

ascii: 455 bytes
utf8: 455 bytes
utf16: 912 bytes

Remarks:

The OP asked "[…] writing a python object with json.dump", so the "optimisation" by removing whitespace doesn't really matter, but I left it in as it might benefit others.
Encoding matters. ascii and utf8 (the default) will result in the same filesize, if the output only contains ASCII characters. Because I added a little extra at the end of the JSON, the file sizes for both encodings differ. And utf16 will of course be the largest of the three.
As stated before, you can use len to get the size of the object if you encode it first.

Abhinav Mathur · Answer 2 · 2022-04-05T09:22:50.580

-2

Convert the JSON to a string, then use sys.getsizeof(). It returns the size in bytes, so you can divide by 1024 if you want to compare it to a threshold value in kilobytes.

sys.getsizeof(json.dumps(object))

Sample usage:

import json
import sys
x = '{"name":"John", "age":30, "car":null}'
y = json.loads(x)
print(sys.getsizeof(json.dumps(y))) # 89

Edit:
As mentioned in this thread, objects have a higher size in memory. So subtract 49 from the size to get a better estimate.

print(sys.getsizeof(json.dumps(y)) - sys.getsizeof(""))

edited Apr 05 '22 at 09:22

answered Apr 05 '22 at 08:16

Abhinav Mathur

7,791
3
10
24

1

`sys.getsizeof()` tells you the size of the object in memory which is not the same as its file size. – martineau Apr 05 '22 at 08:53
@martineau you're right, added documentation and fixed it – Abhinav Mathur Apr 05 '22 at 09:00
2

My point is that `getsizeof()` has no bearing on the problem (so doesn't need to be used). The correct way to do this is to use `len(json.dumps(obj))`. – martineau Apr 05 '22 at 09:05

How to get the equivalent file size of a json object in Python?

2 Answers2