12

What is the proper way to perform mass updates on entities in a Google App Engine Datastore? Can it be done without having to retrieve the entities?

For example, what would be the GAE equivilant to something like this in SQL:

UPDATE dbo.authors
SET    city = replace(city, 'Salt', 'Olympic')
WHERE  city LIKE 'Salt%';
Yarin
  • 173,523
  • 149
  • 402
  • 512

3 Answers3

9

There isn't a direct translation. The datastore really has no concept of updates; all you can do is overwrite old entities with a new entity at the same address (key). To change an entity, you must fetch it from the datastore, modify it locally, and then save it back.

There's also no equivalent to the LIKE operator. While wildcard suffix matching is possible with some tricks, if you wanted to match '%Salt%' you'd have to read every single entity into memory and do the string comparison locally.

So it's not going to be quite as clean or efficient as SQL. This is a tradeoff with most distributed object stores, and the datastore is no exception.

That said, the mapper library is available to facilitate such batch updates. Follow the example and use something like this for your process function:

def process(entity):
  if entity.city.startswith('Salt'):
    entity.city = entity.city.replace('Salt', 'Olympic')
    yield op.db.Put(entity)

There are other alternatives besides the mapper. The most important optimization tip is to batch your updates; don't save back each updated entity individually. If you use the mapper and yield puts, this is handled automatically.

Drew Sears
  • 12,812
  • 1
  • 32
  • 41
  • Drew- thanks very much for the mapper reference- looks like something I'm going to want to learn about. – Yarin Nov 19 '11 at 22:21
5

No, it can't be done without retrieving the entities.

There's no such thing as a '1000 max record limit', but there is of course a timeout on any single request - and if you have large amounts of entities to modify, a simple iteration will probably fall foul of that. You could manage this by splitting it up into multiple operations and keeping track with a query cursor, or potentially by using the MapReduce framework.

Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
  • Daniel thanks- ..Could have sworn there was a max record limit at one point- did they get rid of it? – Yarin Nov 20 '11 at 19:07
  • In my case, I did not get a "timeout" per se, I got the specific error: `While handling this request, the process that handled this request was found to be using too much memory and was terminated.` – Nate Anderson Oct 18 '15 at 20:10
  • Allow me to emphasize that MapReduce is designed to solve at least "both" the problems of "too much memory" (aka **large**) and "timeout" (aka **long-running**): [`It is useful for large, long-running jobs that cannot be handled within the scope of a single request, tasks like:`](https://github.com/GoogleCloudPlatform/appengine-mapreduce/wiki/1-MapReduce) – Nate Anderson Oct 18 '15 at 20:12
2

you could use the query class, http://code.google.com/appengine/docs/python/datastore/queryclass.html

 query = authors.all().filter('city >', 'Salt').fetch()
 for record in query:
   record.city = record.city.replace('Salt','Olympic')
Adithya Surampudi
  • 4,354
  • 1
  • 17
  • 17
  • 2
    thanks, but isn't this bringing all objects into memory, and also constrained by the 1000 max record limits? – Yarin Nov 19 '11 at 21:16