2

In an Python GAE application that I'm working on, we need to retrieve n rows from storage, and we are running into performance issues for n > 100. We expect n to be less than 10000 is most cases.

So let's consider a simple model:

class MyEntity(ndb.Model):
    field1 = nbd.StringProperty()
    field2 = ndb.StringProperty()
    #...
    fieldm = ndb.StringProperty()
    # m is quite large, maybe ~ 30. Stored strings are short - in the order of 30 characters or less

I've populated the data store with some data, and got really bad performance using plain fetch(). I've since removed all filters, and just trying to get a number of entities seems to get very bad performance (as compared with what I would expect, say, for any common SQL deployment. I know that we shouldn't compare GAE to SQL, but just getting flat rows down - I would expect to be more performant, not less). Here's what I've tried:

  • The simplest approach MyEntity.all().fetch(n). This scales linearly with n, which is expected. Although I didn't expect it to take 7s for n = 1000.
  • Trying to coerce fetch() with any reasonable batch_size degrades performance further. I've tried values ranging from 1 to 1000.
  • Doing keys_only gives an order of magnitude improvement.
  • Doing a query manually (through ndb.Query), and getting out just a single field gives a small improvement, in the order of 1.2.
  • Doing a fetch_async(n) and waiting gives exactly same performance.
  • Splitting the job into p parts, then doing fetch_async(n/p, offset=...) and then waiting and joining all futures - gives at best same performance, at worst - much worse performance.
  • Similar story with fetch_page()

I've also tried using db instead of ndb, and the results are pretty much the same. So, now I'm not sure what to do? Is there a way to get half decent performance for n in the order of 10000? Even simplifying my entities to single fields, the performance is too poor. I expect the entire payload uncompressed to be roughly 1 mb. Downloading 1mb in over a minute is clearly unacceptable.

I am seeing this issue live, but for performance testing I'm using remote api. My question is similar to this question on SO: Best practice to query large number of ndb entities from datastore. They didn't seem to find a solution, but it was asked 4 years ago, maybe there is one now.

Community
  • 1
  • 1
Gleno
  • 16,621
  • 12
  • 64
  • 85

2 Answers2

1

If you only need a subset of the fields for the model, then look into Projection Queries

N.P
  • 245
  • 1
  • 12
  • There's a performance benefit in getting a lot of queries, but I'm looking for a general solution that would boost performance of getting n~10000 rows. – Gleno May 23 '16 at 21:33
1

You can get better performance depending on what you need to do with the retrieved data. For example, using as you mentioned _async documentation here. You may retrieve a subset of the data, say first 100, then call _async on the next subset of 100 records while processing the first data batch since _async is non blocking. By the end of your precessing, use .get_result() to retrieve the second batch results and start processing on its data while calling the third batch using _async...etc.

Khaled
  • 907
  • 1
  • 8
  • 18