2

The current value of an Entity's status attribute can queried as the latest entry in a EntityHistory table for that entity, i.e.

Entities (id) <- EntityHistory (timestamp, entity_id, value)

How do I write an efficient SQLALchemy expression that eagerly loads the current value from the history table for all entities without resulting in N+1 queries?

I tried writing a property for my model, but this generates a query for each (N+1) when I iterate over it. To my knowledge, there is no way to solve this without a subquery, which still seems inefficient to me on the database.

Example EntityHistory data:

timestamp |entity_id| value
==========|=========|======
     15:00|        1|     x
     15:01|        1|     y
     15:02|        2|     x
     15:03|        2|     y
     15:04|        1|     z

So the current value for entity 1 would be z and for entity 2 it would be y. The backing database is Postgres.

Petrus Theron
  • 27,855
  • 36
  • 153
  • 287
  • Would a correlated subquery as a [`column_property()`](http://docs.sqlalchemy.org/en/latest/orm/mapped_sql_expr.html#using-column-property) work for you? It's loaded during instance load time, not when you access it. You'd have 1 query, which includes a correlated subquery, but that should be ok with proper indexes in place etc, unless you have ridiculous amounts of data or some such. – Ilja Everilä Jun 18 '16 at 21:14

1 Answers1

6

I think you could use a column_property to load the latest value as an attribute of an Entities instance along other column-mapped attributes:

from sqlalchemy import select
from sqlalchemy.orm import column_property

class Entities(Base):

    ...

    value = column_property(
        select([EntityHistory.value]).
        where(EntityHistory.entity_id == id).  # the id column from before
        order_by(EntityHistory.timestamp.desc()).
        limit(1).
        correlate_except(EntityHistory)
    )

A subquery could of course also be used in a query instead of a column_property.

query = session.query(
    Entities,
    session.query(EntityHistory.value).
        filter(EntityHistory.entity_id == Entities.id).
        order_by(EntityHistory.timestamp.desc()).
        limit(1).
        label('value')
)                  

Performance would naturally depend on proper index being in place:

Index('entityhistory_entity_id_timestamp_idx',
      EntityHistory.entity_id,
      EntityHistory.timestamp.desc())

In a way this is still your dreaded N+1, as the query uses a subquery per row, but it's hidden in a single round trip to the DB.

If on the other hand having value as a property of Entities is not necessary, in Postgresql you can join with a DISTINCT ON ... ORDER BY query to fetch latest values:

values = session.query(EntityHistory.entity_id,
                       EntityHistory.value).\
    distinct(EntityHistory.entity_id).\
    # The same index from before speeds this up.
    # Remember nullslast(), if timestamp can be NULL.
    order_by(EntityHistory.entity_id, EntityHistory.timestamp.desc()).\
    subquery()

query = session.query(Entities, values.c.value).\
    join(values, values.c.entity_id == Entities.id)

though in limited testing with dummy data the subquery-as-output-column always beat the join by a notable margin, if every entity had values. On the other hand if there were millions of entities and a lot of missing history values, then a LEFT JOIN was faster. I'd recommend testing on your own data which query suits your data better. For random access of single entity given that the index is in place a correlated subquery is faster. For bulk fetches: test.

Community
  • 1
  • 1
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
  • Maybe joining against an aggregate of `MAX(timestamp)` and grouped by entity_id could work? http://stackoverflow.com/a/625171/198927 – Petrus Theron Jun 19 '16 at 11:32