Django: retrieve distinct QuerySet

Question

I've got the following models in my app. The Addition model is used to govern the many-to-many relationship between the Book model and the Collection model, since I need to include extra fields on the intermediate model.

class Book(models.Model):
    name = models.CharField(max_length=200)
    picture = models.ImageField(upload_to='img', max_length=1000)
    price = models.DecimalField(max_digits=8, decimal_places=2)

class Collection(models.Model):
    user = models.ForeignKey(User)
    name = models.CharField(max_length=100)
    books = models.ManyToManyField(Book, through='Addition')
    subscribers = models.ManyToManyField(User, related_name='collection_subscriptions', blank=True, null=True)

class Addition(models.Model):
    user = models.ForeignKey(User)
    book = models.ForeignKey(Book)
    collection = models.ForeignKey(Collection)
    created = models.DateTimeField(auto_now=False, auto_now_add=True)
    updated = models.DateTimeField(auto_now=True, auto_now_add=True)

In my app users can add books to collections that they create (for example fiction, history, etc.). Other users can then follow those collections that they like.

When a user logs into the site, I'd like to display all of the books that have been recently added to the collections that they follow. With each book, I'd also like to display the name of the person who added it, and the name of the collection it's in.

I can get all of the additions as follows...

additions = Addition.objects.filter(collection__subscribers=user).select_related()

... but this results in duplicate books being retrieved and displayed to the user, often side by side.

If there a way to retrieve a distinct list of books that are in collections the user is following?

I'm using Django 1.3 + MySQL.

Thanks.

UPDATE

I should add that in general I'm not looking for any 'loop through the results and de-duplicate that way' solutions, for a couple of reasons.

There are likely to be tens or even hundreds of thousands of additions (I am also displaying this information on pages that list all new additions added by users), and response time is extremely important.

This solution may become more practical when limiting the initial result set, but it creates problems with pagination, which is also required. Namely how do you paginate the entire result set while also de-duplicating only a small portion of that set. I'm open to any ideas here that may solve this problem.

UPDATE

I should also mention that if the same book gets added by multiple users, I actually don't have a preference for which addition gets used, either the original or the most recent addition would work fine.

Did you try adding `.distinct()` on the end of your query set? — Cloud Artisans, Nov 22 '11 at 21:10
@gorus that will just give me a distinct set of Addition objects. What I need is a set of Addition objects that have a distinct set of books, something along the lines of `Addition.objects.all().distinct('book')` — rolling stone, Nov 23 '11 at 00:54
Book.objects.filter(addition__collection__subscribers=user).distinct() — armonge, Nov 29 '11 at 04:18
@armonge but then I couldn't display the relevant addition and collection info for each book instance without an additional query from the template, leading to hundreds of queries per page. — rolling stone, Nov 29 '11 at 07:56
I guess you should also clarify: If the same book gets added by multiple users which addition you would prefer... — Bernhard Vallant, Nov 30 '11 at 14:44
thanks @lazerscience, great question. i actually don't have a preference for which addition gets used, either the original or the most recent addition would work fine. — rolling stone, Nov 30 '11 at 19:19
I think if you're expeting a _really_ big load you should probably consider not querying the database for every user if there are any new additions, but rather have some queue for every user where notifications get _pushed_ into... — Bernhard Vallant, Dec 01 '11 at 11:58

score 0 · Answer 1 · answered Nov 29 '11 at 03:51

How about the following - it's not a pure SQL solution, and it'll cost you an extra database query and some loop time, but it should still perform ok, and it'll give you a lot more control over which additions take precedence over others:

def filter_additions(additions):
    # Use a ValuesQuerySet for performance
    additions_values = additions.values()

    # The following code just eliminates duplicates. You could do 
    # something much more powerful/interesting here if you like,
    # e.g. give preference to additions by a user`s friends 

    book_pk_registry = {}
    excluded_addition_pks = []

    for addition in additions_values:
        addition_pk = addition['id']
        book_pk = addition['book_id']
        if book_pk not in book_pk_registry:
            book_pk_registry[book_pk] = True
        else:
            excluded_addition_pks.append(addition_pk)

    additions = additions.exclude(pk__in=excluded_addition_pks)


additions = Addition.objects.filter(collection__subscribers=user)
additions = filter_additions(additions)

If there are likely to be more than a thousand or so books involved, you may want to put a limit on the initial additions query. Passing massive lists of ids over in the exclude isn't such a great idea. Using 'values()' is quite important, because Python can cycle through a basic list of dicts a LOT faster than a queryset and it uses a lot less memory.

However the "performance would not be a problem" for now, but if you use solutions like this, the performance WILL BE a very big problem soon, even if you have a small amount of data. So avoid this. — Lepi, Dec 02 '11 at 12:42

score 0 · Answer 2 · answered Nov 29 '11 at 15:11

0

Assuming there won`t be huge amounts of additions to display, this could easily to the trick:

# duplicated..
additions = Addition.objects.filter(collection__subscribers=user, created__gt=DATE_LAST_LOGIN).select_related()

# remove duplication
added_books = {}
for addition in additions:
    added_books[addition.book] = True
added_books = added_books.keys()

By the description you gave of the problem, performance would not be a problem.

answered Nov 29 '11 at 15:11

Tiago Brandes

274
2
3

1

However the "performance would not be a problem" for now, but if you use solutions like this, the performance WILL BE a very big problem soon, even if you have a small amount of data. So avoid this. – Lepi Dec 02 '11 at 12:42

score 0 · Answer 3 · answered Dec 01 '11 at 11:07

additions = Addition.objects.filter(collection__subscribers=user).values('book').annotate(user=Min('user'), collection=Min('collection')).order_by()

This query will give you list of unique books with their users and collections. Books, collections, users will be pk's, not objects. But I hope you will store them in cache so that won't be a problem.

But for your workload I'd think about denormalization. My query is very heavy, and it isn't easy to cache its results if you will have frequent additions. My first approach will be to add latest_additions field to Collection model and to update with signals (not adding duplicates). The format of this field is up to you.

score 0 · Answer 4 · answered Dec 04 '11 at 08:55

0

Sometimes it's OK to drop into SQL, especially when the ORM-only solution is not performant. It's easy to get the non-duplicate Addition row IDs in SQL, and then you can switch back to the ORM to select the data. It's two queries, but will outperform any of the single query solutions I've seen so far.

from django.db import connection
from operator import itemgetter
cursor = connection.cursor()

# Select non-duplicate book additions, preferring for most recently updated
query = '''SELECT id, MAX(updated) FROM %s
    GROUP BY book_id''' % Addition._meta.db_table
cursor.execute(query)

# Flatten the results to an id list
addition_ids = map(itemgetter(0), cursor.fetchall())

additions = Addition.objects.filter(
    collection__subscribers=user, id__in=addition_ids).select_related()

answered Dec 04 '11 at 08:55

Jeremy Lewis

1,669
14
17

1

Won't this fail when addition_ids gets large? I believe most databases will puke if given an IN operator with more than ~60k values, and you'll get performance degradation a much earlier than that. – Evan Brumley Dec 04 '11 at 22:29
Not when the IN operation is on an indexed column, which is true in this case. See http://stackoverflow.com/questions/5367488/performance-of-where-id-inids – Jeremy Lewis Dec 05 '11 at 19:31
Even if the database can handle it, he's talking about hundreds of thousands of entries. That means queries using this method could end up at several *megabytes* each. That could be ok for a one-off transaction but not for every page-view on his site! – Evan Brumley Dec 05 '11 at 22:53
If that's a concern, he can simply add an ORDER BY MAX(updated) LIMIT XXX to the raw SQL query. There's no reason to load 100,000 IDs if you'll only fetch/display 10-100 items. He can also index the updated column to speed up said query. – Jeremy Lewis Dec 08 '11 at 02:35
Of course - the only downside to that is that he'd lose the ability to paginate nicely using the queryset object. Still, it's a workable enough solution. – Evan Brumley Dec 08 '11 at 02:49

Django: retrieve distinct QuerySet

4 Answers4