GAE: Aggregate work results from tasks? (for a GAE query performance issue)

Question

Which of those options is best to span out work on GAE (to be completed within a reuest timeframe)?

Use of tasks, store the results in memcache, periodically query memcache in the request and hope the tasks complete in time
Use of urlfetch to get results of tasks, error handling and security will be a pain though.
Use of backend instances? (seems insane)
Or a JAVA instance (seems totally insane)

Background: It´s ridiculous to even have to do this. I need to deliver 10k datastore items as a JSON. Apparently the issue is that Python takes a lot of time to process the datastore results (Java seems much faster). This is well covered: 25796142, 11509368 and 21941954

Approach: As there is nothing to optimize on the Software side (can´t re-write GAE), the approach would be to span work out over multiple instances and to aggregate the results.

Querying keys only and getting query cursors for chunks of 2k items performs reasonably well and there tasks could be spun off to get the results in 2k chunks. The question is about how to best aggregate the results.

score 1 · Answer 1 · answered Nov 05 '14 at 16:12

1

It is not "ridiculous" to have to do this: it is an accepted consequence of the scalability offered by GAE. If you don't like the tradeoffs made to enable that scalability, you should choose another platform.

It's also unclear why you think using backend instances is "insane". Using Java would indeed be strange, but only because there's no reason to think it would perform any better.

However, there is a perfectly good way to do this which does not involve any of the hacks you mention, and that is to use the mapreduce framework, which is expressly made for collecting large quantities of data.

answered Nov 05 '14 at 16:12

Daniel Roseman

588,541
66
880
895

Well, It´s the only performance Issue on GAE I could not fix so far, I´m well aware of the tradeoffs. I find it ridiculous because it´s an issue with the GAE python framework, not with the overall scalability of the platform. As it´s orders of magnitude faster on Java, there are always ways to optimize it on Python as well, if needed doing the code that´s the bottleneck in C. I´d love to see a example using MapReduce. – thomasf1 Nov 05 '14 at 18:44
PS: I did not want to offend anyone, I´m a bit frustrated with that one. Sorry! It took me quite a while even realizing what the problem was as appstats shows it as RPC time... But any parallelization of queries or other things assuming that it´s Datastore performance didn´t help. It seems to be the Python code of GAE unpacking the RPC from the Datastore. Which only is really clear after testing it in JAVA... – thomasf1 Nov 05 '14 at 19:02

GAE: Aggregate work results from tasks? (for a GAE query performance issue)

1 Answers1