2

I currently have 3 tables roughly described as the following SQLAlchemy mapping:

class Task(BASE):
    __tablename__ = 'tasks'
    id = Column(Integer, primary_key=True)

    service_id = Column(Integer, ForeignKey('services.id'))
    service = relationship('Service', back_populates="tasks")

    updates = relationship("TaskUpdate")


class TaskUpdate(BASE):
    __tablename__ = 'task_updates'

    id = Column(Integer, primary_key=True)
    external_status = Column(String(32))
    external_updated_at = Column(DateTime(timezone=True))

    task_id = Column(Integer, ForeignKey('tasks.id'))
    task = relationship('Task', back_populates="updates")


class Service(BASE):
    __tablename__ = 'services'

    id = Column(Integer, primary_key=True)

    client_id = Column(Integer, ForeignKey('clients.id'))
    client = relationship('Client', back_populates='services')

So I have one-to-many relationship from Task to TaskUpdates and a many-to-one from Task to Service.

I'm trying to create a query to get all Tasks where their latest TaskUpdate (by timestamp) has an external_status that is "New", or "Open."

Here's what I got:

sub = SESSION.query(
        TaskUpdate.task_id,
        TaskUpdate.external_status.label('last_status'),
        func.max(TaskUpdate.external_updated_at).label('last_update')
        ).group_by(TaskUpdate.task_id
        ).subquery()
tasks = SESSION.query(Task
        ).join(Service
        ).filter(Service.client_id == client_id
        ).join((sub, sub.c.task_id == Task.id)
        ).filter(sub.c.last_status.in_(['New', 'Open']))

When I run this, I get this error:

ProgrammingError: (psycopg2.ProgrammingError) column "task_updates.external_status" must appear in the GROUP BY clause or be used in an aggregate function

I'd appreciate any help you can give. This is important.

Update 1 (this is the SQL that ended up working (as far as I can tell, I can't test the frontend until I get this working in SQLAlchemy though:

SELECT t.* FROM ( 
  SELECT DISTINCT ON (task_id) task_id, external_status 
  FROM task_updates 
  ORDER BY task_id, external_updated_at DESC NULLS LAST) tu 
JOIN tasks t ON t.id = tu.task_id 
JOIN services s ON s.id = t.service_id 
WHERE s.client_id = '" + str(client_id) + "' 
AND tu.external_status IN ('New', 'Open');

Here's my conversion attempt, still not working:

sub = SESSION.query(TaskUpdate).distinct(TaskUpdate.task_id).order_by(TaskUpdate.task_id.desc().nullslast(), TaskUpdate.external_updated_at.desc().nullslast()).subquery()
tasks = SESSION.query(Task).join(Service).join(sub.c.task_id==Task.id).filter(TaskUpdate.external_status.in_(['New', 'Open']))

Update 2: The query I have below works, but when I do .count() it returns the total number of TaskUpdates, not tasks and I suspect the query will need to be redone a different way, unless someone knows a way to handle this?

Phil Salesses
  • 862
  • 1
  • 9
  • 31

3 Answers3

1

On way to do this:

SELECT t.*
FROM  (
   SELECT DISTINCT ON (task_id)
          task_id, external_status
   FROM   task_updates
   ORDER  BY task_id, external_updated_at DESC NULLS LAST
   ) tu
JOIN   tasks t ON t.id = tu.task_id
WHERE  tu.external_status IN ('New', 'Open');

First get the last row per task, then only pick tasks wit the right external_status.

Detailed explanation for DISTINCT ON:

If you have many row per task, there are faster query techniques:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • SELECT t.* FROM (SELECT DISTINCT ON (task_id) task_id, external_status FROM task_updates ORDER BY task_id, external_updated_at DESC NULLS LAST) tu JOIN tasks t ON t.id = tu.task_id JOIN services s ON s.id = t.service_id WHERE s.client_id = '1' AND tu.external_status IN ('New', 'Open'); That was the final query that worked, thanks. Now I need to convert it to SQLAlchemy. – Phil Salesses Apr 02 '17 at 02:12
  • If you do a count on this query, it returns the total number of TaskUpdates. This isn't the ideal behavior. Is there a cleaner one? – Phil Salesses Apr 10 '17 at 05:12
  • @PhilSalesses: If you do a count on this query, it *won't* return the total number of TaskUpdates. You get the count of rows in `tasks` that have at least one related row in `task_updates`. There must be a misunderstanding somewhere. – Erwin Brandstetter Apr 12 '17 at 13:37
0

I'm giving credit to Erwin since he put me onto the the right trail, but this is what I ended up using in the end. Works well enough. Will optimize later once I actually have an engineer or several working with me. :)

Thanks!

sub = SESSION.query(TaskUpdate.task_id, TaskUpdate.external_status).distinct(TaskUpdate.task_id).order_by(TaskUpdate.task_id.desc().nullslast(), TaskUpdate.external_updated_at.desc().nullslast()).subquery()
tasks = SESSION.query(Task).join(Service).join((sub, sub.c.task_id==Task.id)).filter(sub.c.external_status.in_(['New', 'Open', 'Pending']))

Maybe I converted this incorrectly, but when I do a count, it gives me the number of TaskUpdates, not Tasks. This causes problems in my app.

Phil Salesses
  • 862
  • 1
  • 9
  • 31
0

Here's one way to get the wanted result:

In SQL (tested):

SELECT a.task_id, a.external_status, a.external_updated_at
FROM ( 
  SELECT task_id, max(external_updated_at) AS last_updated_at
  FROM task_updates 
  GROUP BY task_id
) b 
JOIN task_updates a ON a.task_id = b.task_id
WHERE
  a.external_updated_at = b.last_updated_at AND
  a.external_status IN ('New', 'Open')
ORDER BY
  a.task_id;

In Python/SQLAlchemy (haven't tested, don't have SQLAlchemy handy at the moment):

subq = session.query(
    TaskUpdate.task_id, func.max(TaskUpdate.external_updated_at).label('last_updated_at')
  ).group_by(
    TaskUpdate.task_id
  ).subquery()

q = session.query(
    TaskUpdate.task_id, TaskUpdate.external_status, TaskUpdate.external_updated_at
  ).join(
    TaskUpdate.task_id == subq.c.task_id)
  ).filter(
    TaskUpdate.external_updated_at == sub.c.last_updated_at,
    TaskUpdate.external_status.in_(['New', 'Open'])
  ).order_by(
    TaskUpdate.task_id
  )
Leo C
  • 22,006
  • 3
  • 26
  • 39