How can I query rows with unique values on a joined column?

Question

I'm trying to have my popular_query subquery remove dupe Place.id, but it doesn't remove it. This is the code below. I tried using distinct but it does not respect the order_by rule.

SimilarPost = aliased(Post)
SimilarPostOption = aliased(PostOption)
popular_query = (db.session.query(Post, func.count(SimilarPost.id)).
         join(Place, Place.id == Post.place_id).
         join(PostOption, PostOption.post_id == Post.id).
         outerjoin(SimilarPostOption, PostOption.val == SimilarPostOption.val).
         join(SimilarPost,SimilarPost.id == SimilarPostOption.post_id).
         filter(Place.id == Post.place_id).
         filter(self.radius_cond()).
         group_by(Post.id).
         group_by(Place.id).
         order_by(desc(func.count(SimilarPost.id))).
         order_by(desc(Post.timestamp))
         ).subquery().select()

all_posts = db.session.query(Post).select_from(filter.pick()).all()

I did a test printout with

print [x.place.name for x in all_posts]

[u'placeB', u'placeB', u'placeB', u'placeC', u'placeC', u'placeA']

How can I fix this?

Thanks!

Does it respect the ordering if you remove the `group_by(Place.id)` clause and add `distinct(Place.id)` instead? I would think that the `group_by` would be unnecessary if you use `distinct`. — Nicholas, Sep 15 '12 at 18:09
Are `self.radius_cond()` and `filter.pick()` things you've defined elsewhere? I don't see anything actually making use of `popular_query`. — Mu Mind, Sep 16 '12 at 20:36

score 4 · Accepted Answer · edited May 23 '17 at 12:21

This should get you what you want:

SimilarPost = aliased(Post)
SimilarPostOption = aliased(PostOption)
post_popularity = (db.session.query(func.count(SimilarPost.id))
        .select_from(PostOption)
        .filter(PostOption.post_id == Post.id)
        .correlate(Post)
        .outerjoin(SimilarPostOption, PostOption.val == SimilarPostOption.val)
        .join(SimilarPost, sql.and_(
                SimilarPost.id == SimilarPostOption.post_id,
                SimilarPost.place_id == Post.place_id)
        )
        .as_scalar())
popular_post_id = (db.session.query(Post.id)
        .filter(Post.place_id == Place.id)
        .correlate(Place)
        .order_by(post_popularity.desc())
        .limit(1)
        .as_scalar())

deduped_posts = (db.session.query(Post, post_popularity)
        .join(Place)
        .filter(Post.id == popular_post_id)
        .order_by(post_popularity.desc(), Post.timestamp.desc())
        .all())

I can't speak to the runtime performance with large data sets, and there may be a better solution, but that's what I managed to synthesize from quite a few sources (MySQL JOIN with LIMIT 1 on joined table, SQLAlchemy - subquery in a WHERE clause, SQLAlchemy Query documentation). The biggest complicating factor is that you apparently need to use as_scalar to nest the subqueries in the right places, and therefore can't return both the Post id and the count from the same subquery.

FWIW, this is kind of a behemoth and I concur with user1675804 that SQLAlchemy code this deep is hard to grok and not very maintainable. You should take a hard look at any more low-tech solutions available like adding columns to the db or doing more of the work in python code.

xception · Answer 2 · 2012-09-16T22:26:13.290

I don't want to sound like the bad guy here but... in my opinion your approach to the issue seems far less than optimal... if you're using postgresql you could simplify the whole thing using WITH ... but a better approach factoring in my assumption that these posts will be read much more often than updated would be to add some columns to your tables that are updated by triggers on insert/update to other tables, at least if performance is likely to ever become an issue this is the solution I'd go with

Not very familiar with sqlalchemy, so can't write it in clear code for you, but the only other solution I can come up with uses at least a subquery to select the things from order by for each of the columns in group by, and that will add significantly to your already slow query

How can I query rows with unique values on a joined column?

2 Answers2