3

When I run a query on a large set of small objects (15k objects with only a few short string and boolean properties), without doing anything with these objects, I see my instance's memory usage continuously increasing (70Mb increase). The memory increase doesn't look proportional to the amount of data it ever needs to keep in memory for just the query.

The loop I use is the following:

cursor = None
while True:
  query = MyModel.all()
  if cursor:
    query.with_cursor(cursor)
  fetched = 0
  for result in query.run(batch_size = 500):
    fetched += 1

    # Do something with 'result' here. Actually leaving it empty for 
    # testing to be sure I don't retain anything myself

    if fetched == 500:
      cursor = query.cursor()
      break
  else:
    break

To be sure this is not due to appstats, I call appstats.recording.dont_record() to not record any stats.

Does anyone have any clue what might be going on? Or any pointers on how to debug/profile this?

Update 1: I turned on gc.set_debug(gc.DEBUG_STATS) on the production code, and I see the garbage collector being called regularly, so it is trying to collect garbage. When I call a gc.collect() at the end of the loop (also the end of the request); it returns 0, and doesn't help.

Update 2: I did some hacking to get guppy to work on dev_appserver, and this seemed to point that, after an explicit gc.collect() at the end of the loop, most of the memory was consumed by a 'dict of google.appengine.datastore.entity_pb.Property'.

Dan McGrath
  • 41,220
  • 11
  • 99
  • 130
Remko
  • 823
  • 6
  • 16

2 Answers2

2

Each model entity has some over head.

You query returns objects as Protobufs for starters.

So you will a series of batched protobufs for the result set.

Then it is decoded. Each decoded entity includes the property names as well as the data for each entity. You have 15K entities. How big are your property names for instance.

So you have at least two copies of the result set in memory in various forms (possibly more), not including anything else you do with instances of the model class.

You code/loop has no opportunity for garbage collections, and that can/will happen later.

Have a look at tools like apptrace to help memory profiling.

Tim Hoffman
  • 12,976
  • 1
  • 17
  • 29
  • I'm guessing i have about 500 characters for property names + values combined, so let's say 1k. Keeping everything in memory would indeed give me about 15 meg, so a couple of times would indeed add up to 60 megs. But why does it need to keep it all in memory? Can you elaborate how you can tell there's no opportunity for garbage collections in that loop? Can I force a garbage collection using `gc.collect()` in the loop, or after the loop? (I tried the former, that didn't have any effect, so that must mean i don't understand why the data is not collectable). – Remko Aug 06 '15 at 12:59
  • Also, I tried apptrace, but this doesn't seem to work anymore in recent dev_appserver setups. – Remko Aug 06 '15 at 13:12
  • I doubt gc.collect will do anything of real value until the request completes. Try calling it the very last thing before finishing the request. I would suggest two things. 1. move away from db, and ndb, there are a number of efficiencies there. 2. Why loop over all 15K entities - re-examine what you are trying to achieve. – Tim Hoffman Aug 07 '15 at 12:53
1

I have reported this to the app engine team, and they seem to confirm this is actually a problem (suspected to be with the handling of cursors).

Remko
  • 823
  • 6
  • 16
  • did you file an issue with the GAE team? If so, please can you post the link? I asked an uncannily similar question on SO yesterday: http://stackoverflow.com/questions/32877705/how-is-memory-garbage-collected-in-app-engine-python-when-iterating-over-db-re/32883298#32883298 – tom Oct 01 '15 at 17:06