Order rows in a many-to-many relationship by count, but fast

Question

I have a many-to-many relationship between the Image and Tag tables in my project:

tags2images = db.Table("tags2images",
    db.Column("tag_id", db.Integer, db.ForeignKey("tags.id", ondelete="CASCADE", onupdate="CASCADE"), primary_key=True),
    db.Column("image_id", db.Integer, db.ForeignKey("images.id", ondelete="CASCADE", onupdate="CASCADE"), primary_key=True)
)

class Image(db.Model):
    __tablename__ = "images"

    id = db.Column(db.Integer, primary_key=True, autoincrement=False)
    title = db.Column(db.String(1000), nullable=True)

    tags = db.relationship("Tag", secondary=tags2images, back_populates="images", passive_deletes=True)

class Tag(db.Model):
    __tablename__ = "tags"

    id = db.Column(db.Integer, primary_key=True, autoincrement=True)
    name = db.Column(db.String(250), nullable=False, unique=True)

    images = db.relationship(
        "Image",
        secondary=tags2images,
        back_populates="tags",
        passive_deletes=True
    )

and I'd like to grab a list of tags, ordered by how many times they're used in images. My images and tags tables contain ~200.000 and ~1.000.000 rows respectively, so there's a decent amount of data.

After a bit of messing around, I arrived at this monstrosity:

db.session.query(Tag, func.count(tags_assoc.c.tag_id).label("total"))\
        .join(tags_assoc)\
        .group_by(Tag)\
        .order_by(text("total DESC"))\
        .limit(20).all()

and while it does return a list of (Tag, count) tuples the way I want it to, it takes several seconds, which is not optimal.

I found this very helpful post (Counting relationships in SQLAlchemy) that helped me simplify the above to just

db.session.query(Tag.name, func.count(Tag.id))\
        .join(Tag.works)\
        .group_by(Tag.id)\
        .limit(20).all()

and while this is wicked fast compared to my first attempt, the output obviously isn't sorted anymore. How can I get SQLAlchemy to produce the desired result while keeping the query fast?

Since your `Tag` class has only one meaningful attribute, `Tag.name`, and it is defined as `unique=True` then you could just use that as the primary key and omit the `id` surrogate (autoincrement) key. That way your aggregation query on the association table would return the `Tag.name` PK directly, thus potentially avoiding an unnecessary join on the surrogate key. — Gord Thompson, Sep 12 '21 at 17:28

score 0 · Answer 1 · answered Sep 12 '21 at 18:59

This seems like something you probably need to use EXPLAIN for in psql. I added a combined index on both the tag_id and image_id via Index('idx_tags2images', 'tag_id', 'image_id'). I'm not sure what is better, individual indices or combined? But maybe see if using a limited subquery on just the association table before joining is faster.

from sqlalchemy import select
tags2images = Table("tags2images",
                    Base.metadata,
                    Column("id", Integer, primary_key=True),
                    Column("tag_id", Integer, ForeignKey("tags.id", ondelete="CASCADE", onupdate="CASCADE"), index=True),
                    Column("image_id", Integer, ForeignKey("images.id", ondelete="CASCADE", onupdate="CASCADE"), index=True),
                    Index('idx_tags2images', 'tag_id', 'image_id'),
)

class Image(Base):
    __tablename__ = "images"

    id = Column(Integer, primary_key=True)
    title = Column(String(1000), nullable=True)

    tags = relationship("Tag", secondary=tags2images, back_populates="images", passive_deletes=True)

class Tag(Base):
    __tablename__ = "tags"

    id = Column(Integer, primary_key=True, autoincrement=True)
    name = Column(String(250), nullable=False, unique=True)

    images = relationship(
        "Image",
        secondary=tags2images,
        back_populates="tags",
        passive_deletes=True
    )

with Session() as session:
    total = func.count(tags2images.c.image_id).label("total")
    # Count, group and order just the association table itself.
    sub = select(
        tags2images.c.tag_id,
        total
    ).group_by(
        tags2images.c.tag_id
    ).order_by(
        total.desc()
    ).limit(20).alias('sub')
    # Now bring in the Tag names with a join
    # we order again but this time only across 20 entries.
    # @NOTE: Subquery will not get tags with image_count == 0
    # since we use INNER join.
    q = session.query(
        Tag,
        sub.c.total
    ).join(
        sub,
        Tag.id == sub.c.tag_id
    ).order_by(sub.c.total.desc())
    for tag, image_count in q.all():
        print (tag.name, image_count)

Order rows in a many-to-many relationship by count, but fast

1 Answers1