0

I have 2 models: Post and PostLike. I want to return a query of Posts filtered by their PostLike created_at field, as well as return the total count of posts in that query efficiently (without using .count()).

Simplified models:

class Post(db.Model):
    ...
    post_likes = db.relationship('PostLike', back_populates="post")

class PostLike(db.Model):
    ...
    created_at = db.column(db.Float)

    post_id = db.Column(UUID(as_uuid=True), db.ForeignKey("post.id"), index=True)
    post = db.relationship("Post", back_populates="post_likes", foreign_keys=[post_id])

Here are the queries I'm trying to run:

# Get all posts
posts = Post.query.join(PostLike, Post.post_likes).order_by(PostLike.created_at.desc()).all()

# Get total # of posts
posts = Post.query.join(PostLike, Post.post_likes).order_by(PostLike.created_at.desc()).count()

There are 3 problems with those queries.

  1. I'm not sure those queries are the best for my use case. Are they?

  2. The query returns the wrong number as count. The count query returns a number higher than the results of the .all() query. Why?

  3. This is not performant as it is calling directly .count(). How do I implement an efficient query to also retrieve the count? Something like .statement.with_only_columns([func.count()])?

I'm using Postgres, and I'm expecting up to millions of rows to count. How do I achieve this efficiently?

Darius Mandres
  • 778
  • 1
  • 13
  • 31
  • After you've called `.all()` can't you just get the count with `len(posts)`? – Gord Thompson Feb 06 '21 at 16:19
  • I'm paginating the results, and it needs to work efficiently with very large amounts of rows. – Darius Mandres Feb 06 '21 at 16:50
  • If you add `echo=True` to your `create_engine()` call and look at the SQL statements that are being generated do they look inefficient to you? – Gord Thompson Feb 06 '21 at 17:14
  • BTW, what leads you to the conclusion that `.count()` is inefficient? – Gord Thompson Feb 06 '21 at 18:09
  • that's what I read online, here's an example: https://stackoverflow.com/a/47801739/3590686, or this https://stackoverflow.com/a/53733077/3590686 – Darius Mandres Feb 06 '21 at 18:14
  • Hmm, now that you mention it I remember seeing something about MySQL and subqueries. However, I just did a test with PostgreSQL on a remote table with a million rows and whether I use `n = session.query(Thing).count()` or `n = session.query(sa.func.count(sa.text("*"))).select_from(Thing).scalar()` I get the result back in 0.2 seconds. Certainly SQLAlchemy is not "fetch[ing] every row in the table". – Gord Thompson Feb 06 '21 at 21:59

1 Answers1

0

re: efficiency of .count()

The comments in other answers (linked in a comment to this question) appear to be outdated for current versions of MySQL and PostgreSQL. Testing against a remote table containing a million rows showed that whether I used

n = session.query(Thing).count()

which renders a SELECT COUNT(*) FROM (SELECT … FROM table_name), or I use

n = session.query(sa.func.count(sa.text("*"))).select_from(Thing).scalar()

which renders SELECT COUNT(*) FROM table_name, the row count was returned in the same amount of time (0.2 seconds in my case). This was tested using SQLAlchemy 1.4.0b2 against both MySQL version 8.0.21 and PostgreSQL version 12.3.

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418