5

I have a very standard, basic social application -- with status updates (i.e., posts), and multiple comments per post.

Given the following simplified models, is it possible, using Django's ORM, to efficiently retrieve all posts and the latest two comments associated with each post, without performing N+1 queries? (That is, without performing a separate query to get the latest comments for each post on the page.)

class Post(models.Model):
    title = models.CharField(max_length=255)
    text = models.TextField()

class Comment(models.Model):
    text = models.TextField()
    post = models.ForeignKey(Post, related_name='comments')

    class Meta:
        ordering = ['-pk']

Post.objects.prefetch_related('comments').all() fetches all posts and comments, but I'd like to retrieve a limited number of comments per post only.

UPDATE:

I understand that, if this can be done at all using Django's ORM, it probably must be done with some version of prefetch_related. Multiple queries are totally okay, as long as I avoid making N+1 queries per page.

What is the typical/recommended way of handling this problem in Django?

UPDATE 2:

There seems to be no direct and easy way to do this efficiently with a simple query using the Django ORM. There are a number of helpful solutions/approaches/workarounds in the answers below, including:

  • Caching the latest comment IDs in the database
  • Performing a raw SQL query
  • Retrieving all comment IDs and doing the grouping and "joining" in python
  • Limiting your application to displaying the latest comment only

I didn't know which one to mark as correct because I haven't gotten a chance to experiment with all of these methods yet -- but I awarded the bounty to hynekcer for presenting a number of options.

UPDATE 3:

I ended up using @user1583799's solution.

tino
  • 4,780
  • 5
  • 24
  • 30
  • I'm not sure that `.select_related('comments')` fetches comments. `.select_related` can fetch ForeignKey, OneToOne relations and reverse-OneToOne – Igor Oct 14 '14 at 12:51
  • @Igor, huh, I didn't realize that was the case. I guess the docs on [prefetch_related](https://docs.djangoproject.com/en/1.6/ref/models/querysets/#prefetch-related) imply this. Thanks for the heads up. – tino Oct 14 '14 at 16:50
  • What is the problem with fetching all related comments? You can later use only first two for each post. `posts[0].comments.all()` won't perform an additional query. Is the problem the fact there are too many related queries to prefetch them all? – Krzysztof Szularz Oct 17 '14 at 13:20
  • @KrzysztofSzularz Thanks for the reply. There are 20 posts per page with hundreds of comments each. It seems like I'm stuck either performing 31 queries to get the 60 comments that will be displayed. Or 2 queries to prefetch and load into memory thousands of comments, 99% of which will not be shown. – tino Oct 17 '14 at 14:48

3 Answers3

2

This solution is optimized for memory requirements, as you expect it important. It needs three queries. The first query asks for posts, the second query only for tuples (id, post_id). The third for details of filtered latest comments.

from itertools import groupby, islice
posts = Post.objects.filter(...some your flter...)
# sorted by date or by id
all_comments = (Comment.objects.filter(post__in=posts).values('post_id')
        .order_by('post_id', '-pk'))
last_comments = []
# the queryset is evaluated now. Only about 100 itens chunks are in memory at
# once during iterations.
for post_id, related_comments in groupby(all_comments(), lambda x: x.post_id):
        last_comments.extend(islice(related_comments, 2))
results = {}
for comment in Comment.objects.filter(pk__in=last_comments):
    results.setdefault(comment.post_id, []).append(comment)
# output
for post in posts:
    print post.title, [x.comment for x in results[post.id]]

But I think it will be faster for many database backends to combine the second and the third query into one and so to ask immediately for all fields of comments. Unuseful comments will be forgotten immediately.

The fastest solution would be with nested queries. The algorithm is like the one above, but everything is realized by raw SQL. It is limited only to some backends like PostgresQL.


EDIT
I agree that is not useful for you

... prefetch loads into memory thousands of comments, 99% of which will not be shown.

and therefore I wrote that relatively complicated solution that 99% of them will be read continuously without loading into memory.


EDIT

  • All examples are for the condition that you wand post_id in [1, 3, 5] (enything selected earlier by categories etc.)
  • In all cases create the index for Comments on fields ['post', 'pk']

A) Nested query for PostgresQL

SELECT post_id, id, text FROM 
  (SELECT post_id, id, text, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
   FROM app_comment WHERE post_id in (1, 3, 5)) sub
WHERE rank <= 2
ORDER BY post_id, id

Or explicitely require with less memory if we don't believe the optimizer. It should read data only from index in two inner selects, which is much less data than from the table.:

SELECT post_id, id, text FROM app_comment WHERE id IN
  (SELECT id FROM
     (SELECT id, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
      FROM app_comment WHERE post_id in (1, 3, 5)) sub
   WHERE rank <= 2)
ORDER BY post_id, id

B) With a cached ID of the oldest displayed comment

  • Add field "oldest_displayed" to Post

    class Post(models.Model):
        oldest_displayed = models.IntegerField()

  • Filter comments for pk if interesting posts (that you have selected earlier by categories etc.)

Filter

from django.db.models import F
qs = Comment.objects.filter(
       post__pk__in=[1, 3, 5],
       post__oldest_displayed__lte=F('pk')
       ).order_by('post_id', 'pk')
pprint.pprint([(x.post_id, x.pk) for x in qs])

Hmm, very nice ... and how it is compiled by Django?

>>> print(qs.query.get_compiler('default').as_sql()[0])      # added white space
SELECT "app_comment"."id", "app_comment"."text", "app_comment"."post_id"
FROM "app_comment"
INNER JOIN "app_post" ON ( "app_comment"."post_id" = "app_post"."id" )
WHERE ("app_comment"."post_id" IN (%s, %s, %s)
      AND "app_post"."oldest_displayed" <= ("app_comment"."id"))
ORDER BY app_comment"."post_id" ASC, "app_comment"."id" ASC

Prepare all "oldest_displayed" by one nested SQL initially (and set zero for posts with less than two comments):

UPDATE app_post SET oldest_displayed = 0

UPDATE app_post SET oldest_displayed = qq.id FROM
  (SELECT post_id, id FROM
     (SELECT post_id, id, rank() OVER (PARTITION BY post_id ORDER BY id DESC)
      FROM app_comment ) sub
   WHERE rank = 2) qq
WHERE qq.post_id = app_post.id;
hynekcer
  • 14,942
  • 6
  • 61
  • 99
  • Thanks, hynekcer. I'm not sure, but iterating through all comments might not provide the benefits you suggest, at least according to [this question](http://stackoverflow.com/questions/4222176/why-is-iterating-through-a-large-django-queryset-consuming-massive-amounts-of-me). – tino Oct 20 '14 at 10:44
  • @tino: No. Compared to prefetch, it does read less data (id of related comments, without texts) and saves much less data (only id of two latest comments). Than it reads only object you want to display. I expect that it is faster than other solutions. I it is still not enough, I can improve speed by caching of one numeric variable - the primary key of the oldest of two comments that should be displayed. – hynekcer Oct 20 '14 at 13:40
  • Ah, I see the memory advantage now, thanks! I'd have to profile this to see if it helps, though it may make more sense overall to cache the last two comment id's since there doesn't seem to be an easy way to do this on the retrieval side. You mentioned that the fastest solution would be nested queries... how would you do this in Django with a Postgres backend? – tino Oct 21 '14 at 07:00
  • Yes, I added this all to the answer now. – hynekcer Oct 21 '14 at 22:12
  • Thanks, hynekcer! I'll give some of these a try and figure out what works best. Appreciate it. – tino Oct 22 '14 at 01:00
2

If you're using Django 1.7 the new Prefetch objects—allowing you to customize the prefetch queryset—could prove helpful.

Unfortunately I can't think of a simple way to do exactly what you're asking. If you're on PostgreSQL and are willing to get just the latest comment for each post, the following should work in two queries:

comments = Comment.objects.order_by('post_id', '-id').distinct('post_id')
posts = Post.objects.prefetch_related(Prefetch('comments',
                                               queryset=comments,
                                               to_attr='latest_comments'))

for post in posts:
    latest_comment = post.latest_comments[0] if post.latest_comments else None

Another variation: if your comments had a timestamp and you wanted to limit the comments to the most recent ones by date, that would look something like:

comments = Comment.objects.filter(timestamp__gt=one_day_ago)

...and then as above. Of course, you could still post-process the resulting list to limit the display to a maximum of two comments.

Kevin Christopher Henry
  • 46,175
  • 7
  • 116
  • 102
  • Thanks a lot, Kevin. I can't assume that comments will be within a specific time range, but perhaps I'll settle for just the single latest comment if I can't figure out a way to do this. (Yeah, the new Prefetch object is cool -- right before asking the question I upgraded to 1.7, thinking it might be able to do this.) – tino Oct 20 '14 at 11:03
1

prefetch_related('comments') will fetch all comments of the posts.

I had the same problem, and the database is Postgresql. I found a way:

Add a extra fieldrelated_replies. Note the FieldType is ArrayField, which support in django1.8dev. I copy the code to my project(the version of django is 1.7), just change 2 lines, it works.(or use djorm-pg-array )

class Post(models.Model): related_replies = ArrayField(models.IntegerField(), size=10, null=True)

And use two queries:

posts = model.Post.object.filter()

related_replies_id = chain(*[p.related_replies for p in posts])
related_replies = models.Comment.objects.filter(
    id__in=related_replies_id).select_related('created_by')[::1]  # cache queryset

for p in posts:
    p.get_related_replies = [r for r in related_replies if r.post_id == p.id]

When new comment comes, update related_replies.

  • Thanks! I might end up tracking the latest two comments in the database exactly as you suggest if I can't figure out a good way to do this upon retrieval alone. I also wasn't aware of ArrayField, so appreciate the info. – tino Oct 20 '14 at 11:16