0

I have a document that I am updating in mongodb (pymongo), like so:

collec.replace_one({"_id": id}, json.loads(json.dumps(data, cls=CustomJSONEncoder)), upsert=True)

But it returns me an error like so:

{DocumenteTooLarge}'update' command document too large

However when I run:

sys.getsizeof(json.loads(json.dumps(data, cls=CustomJSONEncoder))

It returns 232. Which should definitely not be exceeding MongoDB 16MB limit for each document right?

UPDATE: Added image showing evaluation of getsizeof

enter image description here

UPDATE 2: After doing some more debugging it is true that the data was exceeding the 16MB limit, the method replace_one was not throwing a detailed error. Rather I tested out using insert_one:

collec.insert_one(json.loads(json.dumps(data, cls=CustomJSONEncoder)))

This then threw me a more definitive error saying:

enter image description here

But one thing I am confused about then is the sys.getsizeof method returning 232 bytes. That should not be the case right?

Feel free to close this if is not useful.

mp252
  • 453
  • 1
  • 6
  • 18
  • Does this answer your question? [mongodb and pymongo 16Mb limit on document size](https://stackoverflow.com/questions/25553186/mongodb-and-pymongo-16mb-limit-on-document-size) – Someone Special Jan 12 '23 at 09:58
  • @SomeoneSpecial I have read that already, and the docs on limits and threshold. But my object is only 232 bytes. Is there some other type of overhead that is happening behind the scenes that I am not aware of? – mp252 Jan 12 '23 at 10:01
  • 232 vs. 16'777'216 is really big difference. There must be something wrong with your `getsizeof` call. – Wernfried Domscheit Jan 12 '23 at 10:01
  • @WernfriedDomscheit Exactly what I thought, but have a look at my updated post, with the image. – mp252 Jan 12 '23 at 10:04
  • This is why I think maybe some type of overhead is being added on somewhere else, but still how much overhead could it be? – mp252 Jan 12 '23 at 10:06
  • 1
    I'd enable profiler and check what exactly was sent to the db. https://www.mongodb.com/docs/manual/reference/method/db.setProfilingLevel/ – Alex Blex Jan 12 '23 at 10:10
  • 1
    Can you print `data` to console? Size of 232 Bytes should not be too much. – Wernfried Domscheit Jan 12 '23 at 10:30
  • @WernfriedDomscheit Yes, I am able to print it, I can etc select key values etc, I am just adding in a profiler now to see what is going on. – mp252 Jan 12 '23 at 10:32
  • 1
    Show sample of your data? 232 bytes is very small object. – Someone Special Jan 12 '23 at 11:51
  • Check UPDATE 2 it was a memory issue. – mp252 Jan 13 '23 at 11:44
  • 1
    [`sys.getsizeof`](https://docs.python.org/3/library/sys.html#sys.getsizeof) cautions _"Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to."_ I think this is the answer to the 232 bytes mystery. – rickhg12hs Jan 13 '23 at 16:11
  • @rickhg12hs, could you elaborate, may be in a form of an answer? I was under impression json.dumps / json.loads sequence makes a deep copy and dereference all nested objects. Doesn't it make all 78Mb "directly attributed" ? – Alex Blex Jan 15 '23 at 03:24
  • @AlexBlex I only think I know what's on the page I linked. My understanding is `dumps`/`loads` does make a deep copy through a JSON string, but `loads` will recreate a tree of objects and `sys.getsizeof` won't show the total size (maybe just the head/top of the tree?). – rickhg12hs Jan 15 '23 at 06:42

1 Answers1

1

It's super useful in part of the "thing I am confused about", as How do I determine the size of an object in Python? doesn't have an accepted answer.

As rickgh12hs pointed out, sys.getsizeof indeed returns memory allocation for the top level object, and the linked answer has some snippets how to calculate the total size.

In this particular case though, we can benefit from mongodb bson package, since it's already installed as a part of mongodb driver. The true size can be calculated with bson.BSON.encode. It will be more accurate too, as the 16MB limit applies to bson encoded data.

Consider following piece of code:

import bson
import sys

obj = {"a": 1, "b": {"c": 2, "d":[{"e":3}, {"f": {"g":5}}]}}
one = {"obj": [obj]}
ten = {"obj": [obj] * 10} # almost 10 times as big

print (sys.getsizeof(one)) # 232
print (sys.getsizeof(ten)) # 232 too
print(len(bson.BSON.encode(one))) # 91
print(len(bson.BSON.encode(ten))) # 775
# which also let us estimate bson weight of `obj` as (775 - 91)/9 = 76 bytes
# and top-level overhead as 91 - 76 = 15 bytes

So to check size of mongodb document you need to use len(bson.BSON.encode(json.loads(json.dumps(data, cls=CustomJSONEncoder)))) instead of sys.getsizeof(json.loads(json.dumps(data, cls=CustomJSONEncoder))

or if you don't use custom encoding: len(bson.BSON.encode(data))

Alex Blex
  • 34,704
  • 7
  • 48
  • 75