I am writing a python object with json.dump
.
But I want to only write objects that would not exceed 10KB file size.
How can estimate the size of an object before writing?
Here's my take on it.
We start with the following sample (taken from here, with a little extra added to make it interesting):
sample_orig = """{
"glossary": {
"title": "example glossary",
"GlossDiv": {
"title": "S",
"GlossList": {
"GlossEntry": {
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef": {
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
},
"GlossSee": "markup"
}
}
}
},
"a little extra" : "∫ßåøπœ®†"
}"""
Next, we define a test function to perform the encoding and output the size:
def encode_sample(sample : str):
for encoding in ('ascii', 'utf8', 'utf16'):
filename = f'{encoding}.json'
encoded_sample = sample.encode(encoding=encoding, errors='replace')
with FileIO(filename, mode='wb') as f:
f.write(encoded_sample)
assert len(encoded_sample) == f.tell()
print(f'{encoding}: {f.tell()} bytes')
The assert
will prove that the size reported by len
is the same if we're dealing with a bytes
(not str
). If not, the call will raise AssertionError
.
We'll encode the original sample first:
encode_sample(sample_orig)
Output:
ascii: 617 bytes
utf8: 627 bytes
utf16: 1236 bytes
Next, we run it through json.loads()
and json.dumps()
to "optimise" the size (i.e. remove unnecessary whitespace):
sample_reduced = json.dumps(json.loads(sample_orig))
encode_sample(sample_reduced)
Output:
ascii: 455 bytes
utf8: 455 bytes
utf16: 912 bytes
Remarks:
The OP asked "[…] writing a python object with json.dump", so the "optimisation" by removing whitespace doesn't really matter, but I left it in as it might benefit others.
Encoding matters. ascii
and utf8
(the default) will result in the same filesize, if the output only contains ASCII characters. Because I added a little extra at the end of the JSON, the file sizes for both encodings differ. And utf16
will of course be the largest of the three.
As stated before, you can use len
to get the size of the object if you encode it first.
Convert the JSON to a string, then use sys.getsizeof(). It returns the size in bytes, so you can divide by 1024
if you want to compare it to a threshold value in kilobytes.
sys.getsizeof(json.dumps(object))
Sample usage:
import json
import sys
x = '{"name":"John", "age":30, "car":null}'
y = json.loads(x)
print(sys.getsizeof(json.dumps(y))) # 89
Edit:
As mentioned in this thread, objects have a higher size in memory. So subtract 49
from the size to get a better estimate.
print(sys.getsizeof(json.dumps(y)) - sys.getsizeof(""))