1

I have the following code to loop over all entities of kind RawEmailModel and update counters:

def update_emails(cursor=None, stats = {}):
    BATCH_SIZE = 100
    if not cursor:
        # Start of the job
        pass
    next_cursor = cursor
    more = True
    try:
        while more:
            rawEmails, next_cursor, more = RawEmailModel.query().fetch_page(BATCH_SIZE, start_cursor=next_cursor)
            ndb.get_context().clear_cache()
            for rawEmail in rawEmails:
                try:
                    stats[rawEmail.userId] += 1
                except Exception:
                    stats[rawEmail.userId] = 0              
            logging.debug(stats)
        logging.debug("Done counting")
    except Exception as e:
        logging.error(e)

I am clearing the ndb cache based on what I read in https://stackoverflow.com/a/12108891/2448805 However, I still get errors saying I'm running out of memory:

20:21:55.240 {u'104211720960924551911': 45622, u'105605183894399744988': 0, u'114651439835375426353': 2, u'112308898027744263560': 667, u'112185522275060884315': 804}

F 20:22:01.389 Exceeded soft private memory limit of 128 MB with 153 MB after servicing 14 requests total

W 20:22:01.390 While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.

I don't get why I'm still running out of memory when I keep clearing the cache on top of the loop? Thanks!

Community
  • 1
  • 1
Debnath Sinha
  • 1,087
  • 1
  • 12
  • 25

1 Answers1

1

Looks like you have a large number of RawEmailModel entries and your stats dict is growing and hitting the memory limit. Your ndb.get_context().clear_cache() is not going to help you here.

You may have to come up with another Model to hold the counts say RawEmailCounterModel with userId and total_count as fields and keep updating it from while loop instead of using your stats dict to do the counting.

At least this will help you with the out of memory issue. But this may not be performant.

gipsy
  • 3,859
  • 1
  • 13
  • 21
  • Stats dict has only 3 keys (for 3 users inside it), it is looping through about 30-50k entries before running out of memory. And though we are fetching ~100k entries totally, the point of using fetch_page is to fetch 100 entries at a time and then clear cache. Which is why I thought there should only be 100 entities in memory at a time, which is why I was wondering why it was running out of memory? – Debnath Sinha Dec 02 '14 at 04:44
  • You meant to say you have only 3 unique rawEmail.userId in your entire collection of RawEmailModel entities? If not you end up growing the stat dict to the number unique rawEmail.userId s in your data store. May be you are sending in stats dict to update_emails with just only 3 users in it but in your code whenever you encounter a keyError you are adding a new entry to the sats dict with 0 as the value. – gipsy Dec 02 '14 at 14:41
  • 1
    @DebnathSinha Another thing you might want to look into is instance size... you can boost, if you are willing to pay more per instance, the actual RAM size of your instances. This will definitely help making sure you have enough memory for what you want to do. The problem is that, as gipsy is saying, your stats dict is growing. It doesn't look like it's your memcache, so clearing it won't fix your memory error. – Patrice Dec 27 '14 at 00:28