2

In the GAE documentation, it states:

Because each get() or put() operation invokes a separate remote procedure call (RPC), issuing many such calls inside a loop is an inefficient way to process a collection of entities or keys at once.

Who knows how many other inefficiencies I have in my code, so I'd like to minimize as much as I can. Currently, I do have a for loop where each iteration has a separate query. Let's say I have a User, and a user has friends. I want to get the latest updates for every friend of the user. So what I have is an array of that user's friends:

for friend_dic in friends:
        email = friend_dic['email']
        lastUpdated = friend_dic['lastUpdated']
        userKey = Key('User', email)
        query = ndb.gql('SELECT * FROM StatusUpdates WHERE ANCESTOR IS :1 AND modifiedDate > :2', userKey, lastUpdated)
        qit = query.iter()
        while (yield qit.has_next_async()):
           status = qit.next()
           status_list.append(status.to_dict())
raise ndb.Return(status_list)

Is there a more efficient way to do this, maybe somehow batch all these into one single query?

rbanffy
  • 2,319
  • 1
  • 16
  • 27
Snowman
  • 31,411
  • 46
  • 180
  • 303
  • Could you provide your data models? There may be an optimization possible depending on how you store friend relationships. Is `friends` just a ListProperty (or a repeated property if you're using NDB) or do you query a separate model for `friends` relationships? – someone1 Aug 27 '12 at 20:24
  • No friends is just a list property. Well there is a separate model for friends, but those don't save lastUpdated, so I just get the friends from the client device along with the date they were last updated and put them in a dictionary.. – Snowman Aug 27 '12 at 21:21
  • Can you please provide the code for how you obtain `friends`, what do you mean you obtain the friends from the client device? – someone1 Aug 28 '12 at 14:26
  • @someone1 the client device, an iPhone, sends an array of friends, and each item in the array is just a dictionary with that friend's email and last updated date. That's it. I just loop through each friend of that user, get their email, look up the key in the for loop, and get the updates for that user. I've just asked this question too http://stackoverflow.com/questions/12161482/how-do-i-make-this-python-function-asynchronous – Snowman Aug 28 '12 at 14:29
  • Reposting your question repeatedly on SO will not change things, please refrain from doing so. If this is how your data is modeled, please look at Proppy's answer for how to achieve concurrency within your code. Otherwise, wrapping this block of code within a tasklet will enable you to process other things as this gets processed. Does that help clarify things? – someone1 Aug 28 '12 at 14:36
  • @someone1 The problem is I'm getting more code back as answers rather than an explanation. I'm a human looking for words to understand that can help me make sense of code, not a machine looking for more code to help me understand more code. Everyone is just telling me how or what, no one is telling me _why_. – Snowman Aug 28 '12 at 14:39

2 Answers2

4

Try looking at NDB's map function: https://developers.google.com/appengine/docs/python/ndb/queryclass#Query_map_async

Example (assuming you keep your friend relationships in a separate model, for this example I assumed a Relationships model):

@ndb.tasklet
def callback(entity):
  email = friend_dic['email']
  lastUpdated = friend_dic['lastUpdated']
  userKey = Key('User', email)
  query = ndb.gql('SELECT * FROM StatusUpdates WHERE ANCESTOR IS :1 AND modifiedDate > :2', userKey, lastUpdated)
  status_updates = yield query.fetch_async()
  raise ndb.Return(status_updates)

qry = ndb.gql("SELECT * FROM Relationships WHERE friend_to = :1", user.key)
updates = yield qry.map_async(callback)
#updates will now be a list of status updates

Update:

With a better understanding of your data model:

queries = []
status_list = []
for friend_dic in friends:
  email = friend_dic['email']
  lastUpdated = friend_dic['lastUpdated']
  userKey = Key('User', email)
  queries.append(ndb.gql('SELECT * FROM StatusUpdates WHERE ANCESTOR IS :1 AND modifiedDate > :2', userKey, lastUpdated).fetch_async())

for query in queries:
  statuses = yield query
  status_list.extend([x.to_dict() for x in statuses])

raise ndb.Return(status_list)
someone1
  • 3,570
  • 2
  • 22
  • 35
  • I like this...where is the Future object though? Say I wanted to run the query and then do other stuff..does calling updates = ... wait until all results are fetched? How do I begin the queries but do other stuff, and return to the results later? – Snowman Aug 27 '12 at 21:20
  • Actually wait..what's the point of two separate queries? I already have that user's friends in a dictionary, why query for them again? – Snowman Aug 27 '12 at 21:27
  • Have a look at my edited code...that's what I'm doing currently..can it get any more async than what I'm doing? – Snowman Aug 27 '12 at 21:33
  • IIRC, `map_async` returns a future. If you just attribute it to `updates` instead of yielding it you can continue doing whatever you want until you try to `get_result`. – rbanffy Aug 28 '12 at 02:25
  • @mohabitar Your current code isn't async at all. You call `has_next_async`, but then immediately call yield on it, which blocks; in the inner loop, you synchronously fetch results. To achieve asynchronicity you either need to write a tasklet or use NDB's map support as someone1 suggests. – Nick Johnson Aug 28 '12 at 09:13
  • @NickJohnson what if my code you see in my original post is wrapped in a tasklet? Does that make it async? – Snowman Aug 28 '12 at 13:07
  • I'm confused...why did Guido say my function was async then: http://stackoverflow.com/questions/12125342/creating-an-asynchronous-method-with-google-app-engines-ndb – Snowman Aug 28 '12 at 14:00
  • My suggestion is to perform the status update queries as a map function when querying the `friends` list. If your data model does not support this, then your revised code is still not optimal. The code you showed Guido was different than what you have here. Using that code as a tasklet will enable you to achieve concurrency outside of that process, but not within in it. Look at proppy's answer for a generic example on how to achieve concurrency within the block of code you've shown. – someone1 Aug 28 '12 at 14:31
  • @someone1 so the code I showed Guido is async as a whole, but the inner loop is not async? Why is that? – Snowman Aug 28 '12 at 14:36
  • It may be too much to discuss over comments as to how tasklets work. If you call upon your tasklet and immediately call a `get_result()` on it, it will just run through the tasklet synchronously. If you call on your tasklet (without yeilding) it will queue up RPCs to execute as a batch later on. The event loop within NDB will handle going through and batching additional requests as your code continues and RPCs are executed. It isn't until you need the results of your tasklet that NDB forces the event loop to finish processing the function. – someone1 Aug 28 '12 at 14:41
  • Your code shown above does not achieve concurrency as it executes a query and immediately start yielding for results. Ideally, you would do a `fetch_async` on each iteration of `friends` and store the Future in a list, then go through that list after going through your `friends` list and merging the results – someone1 Aug 28 '12 at 14:43
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/15908/discussion-between-mohabitar-and-someone1) – Snowman Aug 28 '12 at 14:44
1

You could perform those query concurrently using ndb async methods:

from google.appengine.ext import ndb

class Bar(ndb.Model):
   pass

class Foo(ndb.Model):
   pass

bars = ndb.put_multi([Bar() for i in range(10)])
ndb.put_multi([Foo(parent=bar) for bar in bars])

futures = [Foo.query(ancestor=bar).fetch_async(10) for bar in bars]
for f in futures:
  print(f.get_result())

This launches 10 concurrent Datastore Query RPCs, and the overall latency only depends of the slowest one instead of the sum of all latencies

Also see the official ndb documentation for more detail on how to async APIs with ndb.

proppy
  • 10,495
  • 5
  • 37
  • 66
  • Have a look at my edited code...is it possible to make it more async than what I'm doing or I'm up to the max? – Snowman Aug 27 '12 at 21:34