SqlAlchemy mapped bulk update - make safer and faster?

Question

I'm using Postgres 9.2 and SqlAlchemy. Currently, this is my code to update the rankings of my Things in my database:

lock_things = session.query(Thing).\
                filter(Thing.group_id == 4).\
                with_for_update().all()

tups = RankThings(lock_things) # return sorted tuple (<numeric>, <primary key Thing id>)
rank = 1
for prediction, id in tups:
    thing = session.query(Thing).\
        filter(Thing.group_id == 4).\
        filter(Thing.id == id).one()
    thing.rank = rank
    rank += 1

session.commit()

However, this seems slow. It's also something I want to be atomic, which I why I use the with_for_update() syntax.

I feel like there must be a way to "zip" up the query and so an update in that way.

How can I make this faster and done all in one query?

EDIT: I think I need to create a temp table to join and make a fast update, see:

https://stackoverflow.com/a/20224370/712997

http://tapoueh.org/blog/2013/03/15-batch-update

Any ideas how to do this in SqlAlchemy?

What is `RankThings`? Is it query as well? General recommendation: How would your SQL query look? Then build it in SQLAlchemy, possibly using core constructs. — javex, Mar 11 '14 at 21:53
No. It just computes the rankings of the Thing objects based on their columns from the result found in `lock_things`. I added the `all()` command to show that I'm accessing the results of `lock_things`. — lollercoaster, Mar 11 '14 at 22:11

score 3 · Accepted Answer · answered Mar 12 '14 at 09:08

Generally speaking with such operations you aim for two things:

Do not execute a query inside a loop
Reduce the number of queries required by performing computations on the SQL side

Additionally, you might want to merge some of the queries you have, if possible.

Let's start with 2), because this is very specific and often not easily possible. Generally, the fastest operation here would be to write a single query that returns the rank. There are two options with this:

The query is quick to run so you just execute it whenever you need the ranking. This would be the very simple case of something like this:
```
SELECT 
    thing.*, 
    (POINTS_QUERY) as score
FROM thing
ORDER BY score DESC
```
In this case, this will give you an ordered list of things by some artificial score (e.g. if you build some kind of competition). The POINTS_QUERY would be something that uses a specific thing in a subquery to determine its score, e.g. aggregate the points of all the tasks it has solved.

In SQLAlchemy, this would look like this:
```
score = session.query(func.sum(task.points)).filter(task.thing_id == Thing.id).correlate(Thing).label("score")
thing_ranking = session.query(thing, score).order_by(desc("score"))
```
This is somewhat a little bit more advanced usage of SQLAlchemy: We construct a subquery that returns a scalar value we labled score. With correlate we tell it that thing will come from an outer query (this is important).

So that was the case where you run a single query that gives you a ranking (the ranks a determined based on the index in the list and depend on your ranking strategy). If you can achieve this, it is the best case
The query itself is expensive you want the values cached. This means you can either use the solution above and cache the values outside of the database (e.g. in a dict or using a caching library). Or you compute them like above but update a database field (like Thing.rank). Again, the query from above gives us the ranking. Additionally, I assume the simplest kind of ranking: the index denotes the rank:
```
for rank, (thing, score) in enumerate(thing_ranking):
     thing.rank = rank
```
Notice how I base my rank based on the index using enumerate. Additionally, I take advantage of the fact that since I just queried thing, I already have it in the session, so no need for an extra query. So this might be your solution right here, but read on for some additional info.

Using the last idea from above, we can now tackle 1): Get the query outside the loop. In general I noticed that you pass a list of things to a sorting function that only seems to return IDs. Why? If you can change it, make it so that it returns the things as a whole.

However, it might be possible that you cannot change this function so let's consider what we do if we can't change it. We already have a list of all relevant things. And we get a sorted list of their IDs. So why not build a dict as a lookup for ID -> Thing?

things_dict = dict(thing.id, thing for thing in lock_things)

We can use this dict instead of querying inside the loop:

for prediction, id in tups:
    thing = things_dict[id]

However, it may be possible (for some reason I missed in your example) that not all IDs were returned previously. In that case (or in general) you can take advantage of a similar mapping SQLAlchemy keeps itself: You can ask it for a primary key and it will not query the database if it already has it:

for prediction, id in tups:
    thing = session.query(Thing).get(id)

So that way we have reduced the problem and only execute queries for objects we don't already have.

One last thing: What if we don't have most of the things? Then I didn't solve your problem, I just replaced the query. In that case, you will have to create a new query that fetches all the elements you need. In general this depends on the source of the IDs and how they are determined, but you could always go the least efficient way (which is still way faster than inside-loop queries): Using SQL's IN:

all_things = session.query(Thing).filter(Thing.group_id == 4).filter(Thing.id.in_([id for _, id in tups]).all()

This would construct a query that filters with the IN keyword. However, with a large list of things this is terribly inefficient and thus if you are in this case, it is most likely better you construct some more efficient way in SQL that determines if this is an ID you want.

Summary

So this was a long text. So sum up:

Perform queries in SQL as much as possible if you can write it efficiently there
Use SQLAlchemy's awesomeness to your advantage, e.g. create subqueries
Try to never execute queries inside a loop
Create some mappings for yourself (or use that of SQLAlchemy to your advantage)
Do it the pythonic way: Keep it simple, keep it explicit.

One final thought: If your queries get really complex and you fear you loose control over the queries executed by the ORM, drop it and use the Core instead. It is almost as awesome as the ORM and gives you huge amounts of control over the queries as you build them yourselves. With this you can construct almost any SQL query you can think of and I am certain that the batch updates you mentioned are also possible here (If you see that my queries above lead to many UPDATE statements, you might want to use the Core).

SqlAlchemy mapped bulk update - make safer and faster?

1 Answers1

Summary