4

I'm trying to use the Django ORM for a task that requires a JOIN in SQL. I already have a workaround that accomplishes the same task with multiple queries and some off-DB processing, but I'm not satisfied by the runtime complexity.

First, I'd like to give you a short introduction to the relevant part of my model. After that, I'll explain the task in English, SQL and (inefficient) Django ORM.

The Model

In my CMS model, posts are multi-language: For each post and each language, there can be one instance of the post's content. Also, when editing posts, I don't UPDATE, but INSERT new versions of them.

So, PostContent is unique on post, language and version. Here's the class:

class PostContent(models.Model):
    """ contains all versions of a post, in all languages. """
    language = models.ForeignKey(Language)
    post = models.ForeignKey(Post)           # the Post object itself only
    version = models.IntegerField(default=0) # contains slug and id.
    # further metadata and content left out

    class Meta:
        unique_together = (("resource", "language", "version"),)

The Task in SQL

And this is the task: I'd like to get a list of the most recent versions of all posts in each language, using the ORM. In SQL, this translates to a JOIN on a subquery that does GROUP BY and MAX to get the maximum of version for each unique pair of resource and language. The perfect answer to this question would be a number of ORM calls that produce the following SQL statement:

SELECT
    id, 
    post_id, 
    version,
    v
FROM
    cms_postcontent,  
    (SELECT 
        post_id as p, 
        max(version) as v, 
        language_id as l 
    FROM 
        cms_postcontent 
    GROUP BY 
        post_id, 
        language_id
    ) as maxv
WHERE 
    post_id=p 
    AND version=v 
    AND language_id=l;

Solution in Django

My current solution using the Django ORM does not produce such a JOIN, but two seperate SQL queries, and one of those queries can become very large. I first execute the subquery (the inner SELECT from above):

maxv = PostContent.objects.values('post','language').annotate(
  max_version=Max('version'))

Now, instead of joining maxv, I explicitly ask for every single post in maxv, by filtering PostContent.objects.all() for each tuple of post, language, max_version. The resulting SQL looks like

SELECT * FROM PostContent WHERE 
       post=P1 and language=L1 and version=V1 
    OR post=P2 and language=L2 and version=V2
    OR ...;

In Django:

from django.db.models import Q
conjunc = map(lambda pc: Q(version=pc['max_version']).__and__(
  Q(post=pc['post']).__and__(
  Q(language=pc['language']))), maxv)
result = PostContent.objects.filter(
  reduce(lambda disjunc, x: disjunc.__or__(x), conjunc[1:], conjunc[0]))

If maxv is sufficiently small, e.g. when retrieving a single post, this might be a good solution, but the size of the query and the time to create it grow linearly with the number of posts. The complexity of parsing the query is also at least linear.

Is there a better way to do this, apart from using raw SQL?

andreas
  • 41
  • 2
  • 4
  • This is long, and not especially clear. Please distill this down. – Marcin Mar 11 '12 at 16:58
  • I couldn't make it shorter, but I hope it's clearer now. – andreas Mar 11 '12 at 18:28
  • possible duplicate of [Django Query That Get Most Recent Objects From Different Categories](http://stackoverflow.com/questions/2074514/django-query-that-get-most-recent-objects-from-different-categories) – dbn Jul 03 '14 at 19:10

1 Answers1

0

You can join (in the sense of union) querysets with the | operator, as long as the querysets query the same model.

However, it sounds like you want something like PostContent.objects.order_by('version').distinct('language'); as you can't quite do that in 1.3.1, consider using values in combination with distinct() to get the effect you need.

Marcin
  • 48,559
  • 18
  • 128
  • 201
  • QuerySet.__or__ combines two queries with the SQL `OR` operator, which results in the union of the sets, not a join. – andreas Mar 11 '12 at 16:49
  • I want something similar, but `QuerySet.distinct` in Django 1.3.1 only accepts a boolean as parameter :-) – andreas Mar 11 '12 at 17:17
  • 1
    I see which solution you're after, and it's specific to Postgres: Django 1.4 indeed implements QuerySet.distinct(self, *field_names), which is compiled to the `SELECT DISTINCT ON(fields)` option that Postgres understands. If not using Postgres as the backend, a `NotImplementedError` is returned. I want to use the ORM as an abstraction from the backend, therefore `DISTINCT ON` is no good solution for me. The regular `SELECT DISTINCT` option works different: it eliminates duplicate rows from the results. Since each `PostContent` has a unique private key, `DISTINCT` doesn't change the result. – andreas Mar 11 '12 at 19:23