0

I'm currently building a blog-type app where many users can make many posts. To do this I have built the following models in Flask:

class User(db.Model):
    user_id = db.Column(db.Integer, primary_key=True)
    firstname = db.Column(db.String(20), nullable=False, index=True)
    lastname = db.Column(db.String(20), nullable=False, index=True)
    email = db.Column(db.String(20), unique=True)

    # many-to-many
    contributions = db.relationship("Contributions", backref="user", lazy=True)


class Contributions(db.Model):
    contribution_id = db.Column(db.Integer, primary_key=True)
    user_id = db.Column(
        db.Integer, db.ForeignKey("user.user_id"), nullable=False
    )
    article_id = db.Column(
        db.Integer, db.ForeignKey("article.article_id"), nullable=False
    )
    author_number = db.Column(db.Integer)
    claimed = db.Column(db.Boolean, default=False)


class Article(db.Model):
    title = db.Column(db.String(20), nullable=False)
    article_id = db.Column(db.Integer, nullable=False, primary_key=True)
    pubdate = db.Column(
        db.DateTime(20), nullable=False, default=datetime.utcnow
    )

    contributions = db.relationship(
        "Contributions", backref="article", lazy=True
    )

    def __repr__(self):
        return f"Article title: {self.title}"

Where each contribution represents one user having contributed to one article (each contribution has more information than just the user/article linking).

Now, for each user I want to display the articles they have contributed to and all other users who have contributed to that article. e.g.:

for article in articles:
    for contribution in article:
        authors += str(contrib.user.firstname + " " + contrib.user.lastname)
        authors += str(", " + contrib.user.firstname + " " + contrib.user.lastname)

However, this loop was taking a really long time because I had lazy=True on in my models. Therefore I attempted to subqueryload:

contributions = (
    User.query.options(
        subqueryload("contributions.article.contributions")
    )
    .filter_by(
        lastname=current_user.lastname,
        firstname=current_user.firstname,
        email=None,
    )
    .first()
    .contributions
)

But for each User.query, I can only subqueryload either contributions.article or contributions.article.contributions. No matter which one I do, the other takes extremely long to download. Does anyone have any suggestions for how to improve speed?

SuperShoot
  • 9,880
  • 2
  • 38
  • 55
swedishfished
  • 389
  • 2
  • 12

1 Answers1

1

You've stated:

...for each user I want to display the articles they have contributed to and all other users who have contributed to that article...

This query:

contributions = (
    User.query.options(
        subqueryload("contributions.article.contributions")
    )
    .filter_by(
        lastname=current_user.lastname,
        firstname=current_user.firstname,
        email=None,
    )
    .first()
    .contributions
)

... isn't a particularly efficient way to achieve your stated objective. The main issue is that you are originally querying for a User that you already have through the current_user proxy.

...for each user I want to display the articles they have contributed to...

Then you should query for the user's Article objects:

articles = (
    Article.query.join(Contributions)
    .filter(Contributions.user == current_user)
    .all()
)

...and all other users who have contributed to that article...

The problem is that when we access the contributions attribute of each article, we'll still end up with a query issued to get all of the contributors to the article and depending on the number of articles, this may be a lot of extra queries!

Here you can see the log of queries. The first query finds all articles for current_user. I've marked where I've accessed the contributions attribute of the first article returned from the query so that you can see the query issued to support that attribute access:

2019-08-27 16:00:21,317 INFO sqlalchemy.engine.base.Engine 
SELECT article.title AS article_title, 
       article.article_id AS article_article_id, 
       article.pubdate AS article_pubdate
FROM article INNER JOIN contributions 
ON article.article_id = contributions.article_id
WHERE %(param_1)s = contributions.user_id
2019-08-27 16:00:21,318 INFO sqlalchemy.engine.base.Engine {'param_1': 1}
************************ access article contributions here************************
2019-08-27 16:00:21,321 INFO sqlalchemy.engine.base.Engine 
SELECT contributions.contribution_id AS contributions_contribution_id, 
       contributions.user_id AS contributions_user_id, 
       contributions.article_id AS contributions_article_id, 
       contributions.author_number AS contributions_author_number, 
       contributions.claimed AS contributions_claimed
FROM contributions
WHERE %(param_1)s = contributions.article_id
2019-08-27 16:00:21,332 INFO sqlalchemy.engine.base.Engine {'param_1': 1}

So to avoid that, lets ask for the contributions of the Article to be eagerly loaded:

articles = (
    Article.query.join(Contributions)
    .filter(Contributions.user == current_user)
    .options(subqueryload('contributions'))
    .all()
)

Here are the logs for same but after specifying the eager load:

2019-08-27 16:00:21,317 INFO sqlalchemy.engine.base.Engine 
SELECT article.title AS article_title, 
       article.article_id AS article_article_id, 
       article.pubdate AS article_pubdate
FROM article INNER JOIN contributions 
ON article.article_id = contributions.article_id
WHERE %(param_1)s = contributions.user_id
2019-08-27 16:00:21,318 INFO sqlalchemy.engine.base.Engine {'param_1': 1}
2019-08-27 16:27:00,874 INFO sqlalchemy.engine.base.Engine 
SELECT contributions.contribution_id AS contributions_contribution_id, 
       contributions.user_id AS contributions_user_id, 
       contributions.article_id AS contributions_article_id, 
       contributions.author_number AS contributions_author_number, 
       contributions.claimed AS contributions_claimed, anon_1.article_article_id AS anon_1_article_article_id
FROM (SELECT article.article_id AS article_article_id
      FROM article INNER JOIN contributions 
      ON article.article_id = contributions.article_id
      WHERE %(param_1)s = contributions.user_id) 
AS anon_1 INNER JOIN contributions 
ON anon_1.article_article_id = contributions.article_id 
ORDER BY anon_1.article_article_id
2019-08-27 16:27:00,875 INFO sqlalchemy.engine.base.Engine {'param_1': 1}
************************ access article contributions here************************

Notice there is now no query immediately following the contributions attribute access.

So the next step is to ensure that the user associated with each contribution to each article is eagerly loaded. If the article only has one contributor, current_user, then that user should already be loaded in the identity map and not expired unless you've previously issued a commit, and so in that case a query wouldn't need to be issued to get the user. However, if there are more than one contributor, any contributor other than the current_user will trigger a query. So I suppose you could consider if you need this or not measured against whether the general case is that articles mostly only have one author, or multiple. Assuming you do..

From the docs:

The loader options can also be “chained” using method chaining to specify how loading should occur further levels deep:

And that is what we are trying to accomplish here, an eager load that traverses a couple of levels of the relationship hierarchy, so we chain the eager loading options.

articles = (
    Article.query.join(Contributions)
    .filter(Contributions.user == current_user)
    .options(
        subqueryload("contributions")
        .joinedload("user", innerjoin=True)
    )
    .all()
)

That issues this query to the db layer:

2019-08-27 16:00:21,317 INFO sqlalchemy.engine.base.Engine 
SELECT article.title AS article_title, 
       article.article_id AS article_article_id, 
       article.pubdate AS article_pubdate
FROM article INNER JOIN contributions 
ON article.article_id = contributions.article_id
WHERE %(param_1)s = contributions.user_id
2019-08-27 16:00:21,318 INFO sqlalchemy.engine.base.Engine {'param_1': 1}
2019-08-27 17:01:14,144 INFO sqlalchemy.engine.base.Engine 
SELECT contributions.contribution_id AS contributions_contribution_id, 
       contributions.user_id AS contributions_user_id, 
       contributions.article_id AS contributions_article_id, 
       contributions.author_number AS contributions_author_number, 
       contributions.claimed AS contributions_claimed, 
       anon_1.article_article_id AS anon_1_article_article_id, 
       user_1.user_id AS user_1_user_id, 
       user_1.firstname AS user_1_firstname, 
       user_1.lastname AS user_1_lastname, 
       user_1.email AS user_1_email
FROM (SELECT article.article_id AS article_article_id
      FROM article INNER JOIN contributions 
      ON article.article_id = contributions.article_id
      WHERE %(param_1)s = contributions.user_id) 
AS anon_1 INNER JOIN contributions 
ON anon_1.article_article_id = contributions.article_id INNER JOIN user AS user_1 ON user_1.user_id = contributions.user_id 
ORDER BY anon_1.article_article_id

And a for loop such as this:

for article in articles:
    for contribution in article.contributions:
        print(contribution.user)

...emits no further queries.

SuperShoot
  • 9,880
  • 2
  • 38
  • 55
  • Great! Thank you so much. I did as you showed above except I joined not just Contributions, but also User, and filtered by User.user_id == current_user.user_id (I was getting an error when I did just as you showed above). However, this query alone still takes upwards of 5 seconds just to eager load contributions, users. Do you have any recommendation for improving speed? – swedishfished Aug 27 '19 at 17:28
  • How many contributions does a user have, on average? And for each article, how many contributors? Perhaps you'd be better off pre-generating the strings that list the authors of an article when articles are first added and store the author string with the article. – SuperShoot Aug 27 '19 at 22:54