In an Python GAE application that I'm working on, we need to retrieve n rows from storage, and we are running into performance issues for n > 100. We expect n to be less than 10000 is most cases.
So let's consider a simple model:
class MyEntity(ndb.Model):
field1 = nbd.StringProperty()
field2 = ndb.StringProperty()
#...
fieldm = ndb.StringProperty()
# m is quite large, maybe ~ 30. Stored strings are short - in the order of 30 characters or less
I've populated the data store with some data, and got really bad performance using plain fetch()
. I've since removed all filters, and just trying to get a number of entities seems to get very bad performance (as compared with what I would expect, say, for any common SQL deployment. I know that we shouldn't compare GAE to SQL, but just getting flat rows down - I would expect to be more performant, not less). Here's what I've tried:
- The simplest approach
MyEntity.all().fetch(n)
. This scales linearly withn
, which is expected. Although I didn't expect it to take 7s forn = 1000
. - Trying to coerce
fetch()
with any reasonablebatch_size
degrades performance further. I've tried values ranging from 1 to 1000. - Doing
keys_only
gives an order of magnitude improvement. - Doing a query manually (
through ndb.Query
), and getting out just a single field gives a small improvement, in the order of 1.2. - Doing a
fetch_async(n)
and waiting gives exactly same performance. - Splitting the job into
p
parts, then doingfetch_async(n/p, offset=...)
and then waiting and joining all futures - gives at best same performance, at worst - much worse performance. - Similar story with
fetch_page()
I've also tried using db
instead of ndb
, and the results are pretty much the same. So, now I'm not sure what to do? Is there a way to get half decent performance for n
in the order of 10000? Even simplifying my entities to single fields, the performance is too poor. I expect the entire payload uncompressed to be roughly 1 mb. Downloading 1mb in over a minute is clearly unacceptable.
I am seeing this issue live, but for performance testing I'm using remote api. My question is similar to this question on SO: Best practice to query large number of ndb entities from datastore. They didn't seem to find a solution, but it was asked 4 years ago, maybe there is one now.