I hired a programmer to port my web site -- originally implemented using Django and MySQL -- over to Google App Engine. The database for the original web app is about 2 GB in size, and the largest table has 5 million rows. To port these contents over, as I understand it, the programmer is serializing the database to JSON and then uploading it to Google app engine.
So far his uploading has used 100 hours of CPU time, as billed by GAE, yet it looks like only about 50 or 100 MB has been loaded into the database. Is that a reasonable amount of CPU time for such a small amount of data? MySQL could load this much data in a few minutes, so I don't understand why GAE would be 1000x slower. Is he doing something inefficiently?

- 1,232
- 1
- 13
- 22
-
Bare in mind the dashboard is pretty delayed at reporting the state of the datastore. Mention to your developer that if he is `putting` each row/entity one at a time he should consider batching the `puts` together to save CPU and other resources. – Chris Farmiloe Jul 14 '11 at 17:37
-
It depends more on number of rows then the actual size of data. As GAE uses CPU time for each `Put()` call. – Roman Dolgiy Jul 15 '11 at 19:01
-
@chris-farmiloe Will the `db.put([instance1, instance2])` use less CPU time then `db.put(instance1); db.put(instance2)`? – Roman Dolgiy Jul 15 '11 at 19:10
-
Since it is probably happening over `remote_api`, yes. It's Two entire requests vs One request – Chris Farmiloe Jul 15 '11 at 20:39
3 Answers
That seems high, and it's likely they're making the server do a lot of work (decoding the JSON, encoding and storing the entities) that could be done on the client. There's already a bulkloader provided with the SDK, and if that isn't suitable for some reason, remote_api, on which the bulkloader is based, provides a more efficient option than rolling your own.

- 100,655
- 16
- 128
- 198
-
I'm looking at his code now. The only importing from gae in his "views.py" file is from "google.appengine.api import taskqueue". It looks like he's uploading chunks of data via HTTP post commands. Is this the slow way to do it? – Jeff Jul 15 '11 at 17:47
-
@user793956 It's ultimately the only way to get data into the cloud, but uploading it in JSON format and doing the processing on the cloud is a waste of billed resources. The built-in bulkloader does the processing on the client, and sends the already-created protocol buffers to remote_api. – Nick Johnson Jul 15 '11 at 22:58
I have bulk loaded a GB of data, however i wrote my own bulk load module (based on the interfaces they defined), and it took 25 hours of CPU time.
For more info, you could take a look at App Engine Bulk Loader Performance
That depends a great deal on how he's serializing the data. I STRONGLY suspect that he's doing something inefficient as yes, that's ludicrous for that amount of data. Your inefficiency probably lies in the transfer time and the start/stop time for each query. If he's serializing each row and posting it to a handler one at a time then I could totally understand it both taking forever and consuming a lot of cpu time.

- 1,110
- 2
- 10
- 26
-
I mentioned this to the programmer. He writes "All data is spliced on 100 rows per json file, each task in task queue loads such chunk. I have found such amount of rows in file most optimal. As I said, we do not consume CPU time to download data and deserialize, only to save to the datastore." – Jeff Jul 15 '11 at 17:22