Generally speaking with such operations you aim for two things:
Do not execute a query inside a loop
Reduce the number of queries required by performing computations on the SQL side
Additionally, you might want to merge some of the queries you have, if possible.
Let's start with 2), because this is very specific and often not easily possible. Generally, the fastest operation here would be to write a single query that returns the rank. There are two options with this:
The query is quick to run so you just execute it whenever you need the ranking. This would be the very simple case of something like this:
SELECT
thing.*,
(POINTS_QUERY) as score
FROM thing
ORDER BY score DESC
In this case, this will give you an ordered list of things by some artificial score (e.g. if you build some kind of competition). The POINTS_QUERY
would be something that uses a specific thing
in a subquery to determine its score, e.g. aggregate the points of all the tasks it has solved.
In SQLAlchemy, this would look like this:
score = session.query(func.sum(task.points)).filter(task.thing_id == Thing.id).correlate(Thing).label("score")
thing_ranking = session.query(thing, score).order_by(desc("score"))
This is somewhat a little bit more advanced usage of SQLAlchemy: We construct a subquery that returns a scalar value we labled score
. With correlate
we tell it that thing
will come from an outer query (this is important).
So that was the case where you run a single query that gives you a ranking (the ranks a determined based on the index in the list and depend on your ranking strategy). If you can achieve this, it is the best case
The query itself is expensive you want the values cached. This means you can either use the solution above and cache the values outside of the database (e.g. in a dict or using a caching library). Or you compute them like above but update a database field (like Thing.rank
). Again, the query from above gives us the ranking. Additionally, I assume the simplest kind of ranking: the index denotes the rank:
for rank, (thing, score) in enumerate(thing_ranking):
thing.rank = rank
Notice how I base my rank based on the index using enumerate
. Additionally, I take advantage of the fact that since I just queried thing
, I already have it in the session, so no need for an extra query. So this might be your solution right here, but read on for some additional info.
Using the last idea from above, we can now tackle 1): Get the query outside the loop. In general I noticed that you pass a list of things to a sorting function that only seems to return IDs. Why? If you can change it, make it so that it returns the things as a whole.
However, it might be possible that you cannot change this function so let's consider what we do if we can't change it. We already have a list of all relevant things. And we get a sorted list of their IDs. So why not build a dict
as a lookup for ID -> Thing?
things_dict = dict(thing.id, thing for thing in lock_things)
We can use this dict instead of querying inside the loop:
for prediction, id in tups:
thing = things_dict[id]
However, it may be possible (for some reason I missed in your example) that not all IDs were returned previously. In that case (or in general) you can take advantage of a similar mapping SQLAlchemy keeps itself: You can ask it for a primary key and it will not query the database if it already has it:
for prediction, id in tups:
thing = session.query(Thing).get(id)
So that way we have reduced the problem and only execute queries for objects we don't already have.
One last thing: What if we don't have most of the things? Then I didn't solve your problem, I just replaced the query. In that case, you will have to create a new query that fetches all the elements you need. In general this depends on the source of the IDs and how they are determined, but you could always go the least efficient way (which is still way faster than inside-loop queries): Using SQL's IN
:
all_things = session.query(Thing).filter(Thing.group_id == 4).filter(Thing.id.in_([id for _, id in tups]).all()
This would construct a query that filters with the IN
keyword. However, with a large list of things this is terribly inefficient and thus if you are in this case, it is most likely better you construct some more efficient way in SQL that determines if this is an ID you want.
Summary
So this was a long text. So sum up:
Perform queries in SQL as much as possible if you can write it efficiently there
Use SQLAlchemy's awesomeness to your advantage, e.g. create subqueries
Try to never execute queries inside a loop
Create some mappings for yourself (or use that of SQLAlchemy to your advantage)
Do it the pythonic way: Keep it simple, keep it explicit.
One final thought: If your queries get really complex and you fear you loose control over the queries executed by the ORM, drop it and use the Core instead. It is almost as awesome as the ORM and gives you huge amounts of control over the queries as you build them yourselves. With this you can construct almost any SQL query you can think of and I am certain that the batch updates you mentioned are also possible here (If you see that my queries above lead to many UPDATE
statements, you might want to use the Core).