48

(New SQLAlchemy user alert) I have three tables: a person, the persons hourly rate starting at a specific date, and daily time reporting. I am looking for the correct way to have the cost for a Time base off of the persons hourly rate on that day.

Yes, I could compute the value at the time of creation and have that as part of the model, but think of this as an example of summarizing more complex data behind the curtain. How do I compute the Time.cost? Is it a hybrid_propery, a column_property or something completely different?

class Person(Base):
    __tablename__ = 'person'
    personID = Column(Integer, primary_key=True)
    name = Column(String(30), unique=True)

class Payrate(Base):
    __tablename__ = 'payrate'
    payrateID = Column(Integer, primary_key=True)
    personID  = Column(Integer, ForeignKey('person.personID'))
    hourly    = Column(Integer)
    starting  = Column(Date)
    __tableargs__ =(UniqueConstraint('personID', 'starting',
                                     name='uc_peron_starting'))

class Time(Base):
    __tablename__ = 'entry'
    entryID  = Column(Integer, primary_key=True)
    personID = Column(Integer, ForeignKey('person.personID'))
    workedon = Column(Date)
    hours    = Column(Integer)

    person = relationship("Person")

    def __repr__(self):
        return "<{date} {hours}hrs ${0.cost:.02f}>".format(self, 
                      date=self.workedon.isoformat(), hours=to_hours(self.hours))

    @property
    def cost(self):
        '''Cost of entry
        '''
        ## This is where I am stuck in propery query creation
        return self.hours * query(Payrate).filter(
                             and_(Payrate.personID==personID,
                                  Payrate.starting<=workedon
                             ).order_by(
                               Payrate.starting.desc())
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
Frustrated
  • 772
  • 1
  • 8
  • 16

2 Answers2

127

The problem you have here, to solve as elegantly as possible, uses very advanced SQLAlchemy techniques, so I know you're a beginner, but this answer is going to show you all the way out to the end. However, solving a problem like this requires walking through one step at a time, and you can get the answer you want in different ways as we go through.

Before you get into how to hybrid this or whatever, you need to think about the SQL. How can we query for Time.cost over an arbitrary series of rows? We can link Time to Person cleanly because we have a simple foreign key. But to link Time to Payrate, with this particular schema is tricky, because Time links to Payrate not just via person_id but also via workedon - in SQL we'd join to this most easily using "time.person_id = person.id AND time.workedon BETWEEN payrate.start_date AND payrate.end_date". But you don't have an "end_date" here, which means we have to derive that also. That derivation is the trickiest part, so what I came up with starts like this (I've lowercased your column names):

SELECT payrate.person_id, payrate.hourly, payrate.starting, ending.ending
FROM payrate LEFT OUTER JOIN
(SELECT pa1.payrate_id, MIN(pa2.starting) as ending FROM payrate AS pa1
JOIN payrate AS pa2 ON pa1.person_id = pa2.person_id AND pa2.starting > pa1.starting
GROUP BY pa1.payrate_id
) AS ending ON payrate.payrate_id=ending.payrate_id

There might be other ways to get this, but that's what I came up with - other ways would almost certainly have some similar kind of thing going on (i.e. subqueries, joins).

So with a payrate starting/ending, we can figure out what a query would look like. We want to use BETWEEN to match a time entry to the date range, but the latest payrate entry will have NULL for the "end" date, so one way to work around that is to use COALESCE against a very high date (the other is to use conditionals):

SELECT *, entry.hours * payrate_derived.hourly
FROM entry
JOIN
    (SELECT payrate.person_id, payrate.hourly, payrate.starting, ending.ending
    FROM payrate LEFT OUTER JOIN
    (SELECT pa1.payrate_id, MIN(pa2.starting) as ending FROM payrate AS pa1
    JOIN payrate AS pa2 ON pa1.person_id = pa2.person_id AND pa2.starting > pa1.starting
    GROUP BY pa1.payrate_id
    ) AS ending ON payrate.payrate_id=ending.payrate_id) as payrate_derived
ON entry.workedon BETWEEN payrate_derived.starting AND COALESCE(payrate_derived.ending, "9999-12-31")
AND entry.person_id=payrate_derived.person_id
ORDER BY entry.person_id, entry.workedon

Now what @hybrid can do for you in SQLAlchemy, when run at the SQL expression level, is exactly just the "entry.hours * payrate_derived.hourly" part, that's it. All the JOIN and such there, you'd need to provide externally to the hybrid.

So we need to stick that big subquery into this:

class Time(...):
    @hybrid_property
    def cost(self):
        # ....

    @cost.expression
    def cost(cls):
        return cls.hours * <SOMETHING>.hourly

So let's figure out what <SOMETHING> is. Build up that SELECT as an object:

from sqlalchemy.orm import aliased, join, outerjoin
from sqlalchemy import and_, func

pa1 = aliased(Payrate)
pa2 = aliased(Payrate)
ending = select([pa1.payrate_id, func.min(pa2.starting).label('ending')]).\
            select_from(join(pa1, pa2, and_(pa1.person_id == pa2.person_id, pa2.starting > pa1.starting))).\
            group_by(pa1.payrate_id).alias()

payrate_derived = select([Payrate.person_id, Payrate.hourly, Payrate.starting, ending.c.ending]).\
    select_from(outerjoin(Payrate, ending, Payrate.payrate_id == ending.c.payrate_id)).alias()

The cost() hybrid, on the expression side, would need to refer to payrate_derived (we'll do the python side in a minute):

class Time(...):
    @hybrid_property
    def cost(self):
        # ....

    @cost.expression
    def cost(cls):
        return cls.hours * payrate_derived.c.hourly

Then in order to use our cost() hybrid, it would have to be in the context of a query that has that join. Note here we use Python's datetime.date.max to get that max date (handy!):

print session.query(Person.name, Time.workedon, Time.hours, Time.cost).\
                    select_from(Time).\
                    join(Time.person).\
                    join(payrate_derived,
                            and_(
                                payrate_derived.c.person_id == Time.person_id,
                                Time.workedon.between(
                                    payrate_derived.c.starting,
                                    func.coalesce(
                                        payrate_derived.c.ending,
                                        datetime.date.max
                                    )
                                )
                            )
                    ).\
                    all()

So that join is big, and klunky, and we will need to do it often, not to mention we're going to need to load that same collection in Python when we do our in-Python hybrid. We can map to it using relationship(), which means we have to set up custom join conditions, but also we need to actually map to that subquery, using a lesser-known technique called a non-primary mapper. A non-primary mapper gives you a way to map a class to some arbitrary table or SELECT construct just for the purposes of selecting rows. We normally never need to use this because Query already lets us query for arbitrary columns and subqueries, but to get it out of a relationship() it needs a mapping. The mapping needs a primary key to be defined, and the relationship also needs to know which side of the relationship is "foreign". This is the most advanced part here and in this case it works out like this:

from sqlalchemy.orm import mapper, relationship, foreign

payrate_derived_mapping = mapper(Payrate, payrate_derived, non_primary=True,
                                        primary_key=[
                                            payrate_derived.c.person_id,
                                            payrate_derived.c.starting
                                        ])
Time.payrate = relationship(
                    payrate_derived_mapping,
                    viewonly=True,
                    uselist=False,
                    primaryjoin=and_(
                            payrate_derived.c.person_id == foreign(Time.person_id),
                            Time.workedon.between(
                                payrate_derived.c.starting,
                                func.coalesce(
                                    payrate_derived.c.ending,
                                    datetime.date.max
                                )
                            )
                        )
                    )

So that's the last we'd have to see of that join. We can now do our query earlier as:

print session.query(Person.name, Time.workedon, Time.hours, Time.cost).\
                    select_from(Time).\
                    join(Time.person).\
                    join(Time.payrate).\
                    all()

and finally we can wire up our new payrate relationship into the Python-level hybrid as well:

class Time(Base):
    # ...

    @hybrid_property
    def cost(self):
        return self.hours * self.payrate.hourly

    @cost.expression
    def cost(cls):
        return cls.hours * payrate_derived.c.hourly

The solution we have here took a lot of effort, but at least the most complex part, that payrate mapping, is entirely in just one place and we never need to look at it again.

Here's a full working example:

from sqlalchemy import create_engine, Column, Integer, ForeignKey, Date, \
                    UniqueConstraint, select, func, and_, String
from sqlalchemy.orm import join, outerjoin, relationship, Session, \
                    aliased, mapper, foreign
from sqlalchemy.ext.declarative import declarative_base
import datetime
from sqlalchemy.ext.hybrid import hybrid_property


Base = declarative_base()

class Person(Base):
    __tablename__ = 'person'
    person_id = Column(Integer, primary_key=True)
    name = Column(String(30), unique=True)

class Payrate(Base):
    __tablename__ = 'payrate'
    payrate_id = Column(Integer, primary_key=True)
    person_id  = Column(Integer, ForeignKey('person.person_id'))
    hourly    = Column(Integer)
    starting  = Column(Date)

    person = relationship("Person")
    __tableargs__ =(UniqueConstraint('person_id', 'starting',
                                     name='uc_peron_starting'))

class Time(Base):
    __tablename__ = 'entry'
    entry_id  = Column(Integer, primary_key=True)
    person_id = Column(Integer, ForeignKey('person.person_id'))
    workedon = Column(Date)
    hours    = Column(Integer)

    person = relationship("Person")

    @hybrid_property
    def cost(self):
        return self.hours * self.payrate.hourly

    @cost.expression
    def cost(cls):
        return cls.hours * payrate_derived.c.hourly

pa1 = aliased(Payrate)
pa2 = aliased(Payrate)
ending = select([pa1.payrate_id, func.min(pa2.starting).label('ending')]).\
            select_from(join(pa1, pa2, and_(
                                        pa1.person_id == pa2.person_id,
                                        pa2.starting > pa1.starting))).\
            group_by(pa1.payrate_id).alias()

payrate_derived = select([Payrate.person_id, Payrate.hourly, Payrate.starting, ending.c.ending]).\
    select_from(outerjoin(Payrate, ending, Payrate.payrate_id == ending.c.payrate_id)).alias()

payrate_derived_mapping = mapper(Payrate, payrate_derived, non_primary=True,
                                        primary_key=[
                                            payrate_derived.c.person_id,
                                            payrate_derived.c.starting
                                        ])
Time.payrate = relationship(
                    payrate_derived_mapping,
                    viewonly=True,
                    uselist=False,
                    primaryjoin=and_(
                            payrate_derived.c.person_id == foreign(Time.person_id),
                            Time.workedon.between(
                                payrate_derived.c.starting,
                                func.coalesce(
                                    payrate_derived.c.ending,
                                    datetime.date.max
                                )
                            )
                        )
                    )



e = create_engine("postgresql://scott:tiger@localhost/test", echo=False)
Base.metadata.drop_all(e)
Base.metadata.create_all(e)

session = Session(e)
p1 = Person(name='p1')
session.add(p1)

session.add_all([
    Payrate(hourly=10, starting=datetime.date(2013, 5, 17), person=p1),
    Payrate(hourly=15, starting=datetime.date(2013, 5, 25), person=p1),
    Payrate(hourly=20, starting=datetime.date(2013, 6, 10), person=p1),
])

session.add_all([
    Time(person=p1, workedon=datetime.date(2013, 5, 19), hours=10),
    Time(person=p1, workedon=datetime.date(2013, 5, 27), hours=5),
    Time(person=p1, workedon=datetime.date(2013, 5, 30), hours=5),
    Time(person=p1, workedon=datetime.date(2013, 6, 18), hours=12),
])
session.commit()

print session.query(Person.name, Time.workedon, Time.hours, Time.cost).\
                    select_from(Time).\
                    join(Time.person).\
                    join(Time.payrate).\
                    all()

for time in session.query(Time):
    print time.person.name, time.workedon, time.hours, time.payrate.hourly, time.cost

Output (first line is the aggregate version, remainder is the per-object):

[(u'p1', datetime.date(2013, 5, 19), 10, 100), (u'p1', datetime.date(2013, 5, 27), 5, 75), (u'p1', datetime.date(2013, 5, 30), 5, 75), (u'p1', datetime.date(2013, 6, 18), 12, 240)]
p1 2013-05-19 10 10 100
p1 2013-05-27 5 15 75
p1 2013-05-30 5 15 75
p1 2013-06-18 12 20 240
Eric
  • 95,302
  • 53
  • 242
  • 374
zzzeek
  • 72,307
  • 23
  • 193
  • 185
  • 4
    if you put an end_date on your Payrate table half the code here goes away. – zzzeek Jun 27 '13 at 19:53
  • I was also looking at using `starting <= workedon order by starting DESC limit 1` – Frustrated Jun 28 '13 at 04:37
  • 1
    I much prefer your explanation as the generic walk-through solution on hiding complexity. Very well done. I just need to finish reading up on the different parts of SQLAlchemy you touch upon. – Frustrated Jun 28 '13 at 04:46
6

Many times the best advice I can give is to just do it different. A multi-table calculated column like this is what database views are for. Build a view based on the Time table (or whatever else you want) with your calculated column in it, build a model based on the view, and you're set. This will likely be less stressful on the database as well. This is also a good example of why limiting design to what can be accomplished through automated migrations is dangerous.

juanitogan
  • 1,698
  • 1
  • 22
  • 37
  • 3
    So obvious, yet I didn't even think about it. – EndermanAPM Nov 17 '20 at 19:09
  • Can I ask - why do you think automated migrations are relevant? You can create views with them; at least in SQLAlchemy you can. – Rob Grant Jun 20 '21 at 21:09
  • 1
    @RobGrant, Yeah, that is a loaded topic that I maybe should have left out. In general, in practice, I find that active record, for example, contributes heavily to limiting design to 3rd normal form. Yes, you can break beyond that in some active-record tools, but it takes effort few are willing to reach for. Worse, it lends to a mode of thinking that app and db objects should have a 1:1 relationship. That works too, but is usually far from ideal when building your most valuable asset that needs to be both a fortress and a hot rod. Yes, SQLAlchemy is nice and flexible, but still step carefully. – juanitogan Jun 21 '21 at 22:32
  • @RobGrant, I think I still missed your question. To be precise, migrations are relevant because (and certainly at the time of this answer) I find them to be terribly limiting (and typically limited to the active-record pattern). I don't recall if basic views and materialized views were supported at the time, but how you might want to wire a view up to the rest of the model certainly had limits. I haven't used SQLAlchemy on a new project since, but, in an older version, I find myself always manually adding lines to the migration code that it simply cannot generate on its own. – juanitogan Jun 21 '21 at 23:10
  • @juanitogan ah sure - I think when I saw "automatic migrations" I thought you meant "the act of automatically migrating schema and data", when I think you're more referring to "the automation of migration creation". For me, the latter is useful, but the former is vital, and so having some migrations that are just SQL is not a problem. – Rob Grant Jun 22 '21 at 09:14