0

I'm trying to convert as much raw SQL to use the Django ORM as I can, and I've run into a snag. I'm trying to perform a query similar to this one:

SELECT table.x,
      MAX(table.y) AS y,
      table.group_category,
      table.group_number,
FROM table
GROUP BY table.group_category, table.group_number

So far, what I've been trying has been some permutation of this:

q = MyModel.objects\
    .filter(**filter_kwargs)\
    .values('group_category', 'group_number')\
    .annotate(y=Max('y'))\
    .values('x','y','group_category','group_number')

However, this doesn't seem to work. If I exclude the last values(), it produces the following (roughly):

SELECT MAX(table.y) AS y,
      table.group_category,
      table.group_number,
FROM table
GROUP BY table.group_category, table.group_number

It doesn't select table.x. But if I include the last values()...

SELECT table.x,
      MAX(table.y) AS y,
      table.group_category,
      table.group_number,
FROM table
GROUP BY x, y, table.group_category, table.group_number

It groups by x, y. So what clearly seems to be happening is that the all of the values are replaced and annotate uses whatever values the QuerySet is given (because it's evaluated lazily?). The docs on aggregation and values seem to suggest that doing the two values functions in this order would have the desired effect, and I found a writeup (from 2013) that also suggests this. Am I doing something wrong? Is this still possible in the Django ORM? Is there any way for me to do this without using extra() or raw SQL? I tried to keep this example as simple as possible for demonstration purposes, but my actual problem involves a JOIN. Could that be complicating it?


UPDATE 1

I was able to kinda figure it out, however, it still doesn't produce the best version (above) of the SQL query that I want. In order to get the results I needed, I instead perform a query to get MAX(table.y), and then use that as a subquery using __in against the subquery's value. The subquery does the grouping.

filtered = MyModel.objects.filter(**filter_kwargs)

subq = filtered\
    .values('group_category', 'group_number')\
    .annotate(y=Max('y'))\
    .values_list('y', flat=True)

q = filtered\
    .filter(y__in=subq)\
    .values('x','y','group_category','group_number')

As I said, that works as in it gets me the results I need. The problem is that it's much slower than just having SELECTs different than the GROUP BYs, because it creates a relatively bulky subquery. I'm not marking this as the answer just yet because it still doesn't produce a query that matches what I really want. Instead, it looks something like this:

SELECT table.x,
      table.y,
      table.group_category,
      table.group_number,
FROM table
WHERE y IN 
    (SELECT MAX(U0.y) AS y
    FROM table U0
    GROUP BY U0.group_category, U0.group_number)

Furthermore, it doesn't look like I can even use extra() because it will similarly only add columns to the SELECT clause that are already part of the QuerySet, i.e. values().


UPDATE 2

It turns out that my messy workaround does not work, because it gets the MAX of all the y (1 row) and returns it, rather than grouping them together by group_category and group_number and using the MAX of their y's, so I'm back to the drawing board.

user7859002
  • 43
  • 1
  • 6
  • Take a look at https://stackoverflow.com/questions/37599943/django-orm-group-by-one-column-and-return-other and https://simpleisbetterthancomplex.com/tutorial/2016/12/06/how-to-create-group-by-queries.html – jackotonye Jan 04 '18 at 01:24
  • Thanks for the links. I had already read that tutorial posted, but unfortunately it doesn't cover my use case, and I had used it to get as far as I did. As for the other question... their example code (regardless of their issue with the type being returned) doesn't seem to produce the SQL query that they need, and reflects my first example shown in the question (i.e. without the other selected values). A related question is (https://stackoverflow.com/questions/17888558/django-orm-grouped-aggregates-with-different-select-clauses) but I still can't seem to figure it out. – user7859002 Jan 04 '18 at 16:22

1 Answers1

0

What you seem to want is to calculate max values but to return all rows without any grouping. This is what Window functions are for (available from Django 2.0):

models = MyModel.objects.annotate(max=Window(
    expression=Max('y'),
    partition_by=[F('group_category'), F('group_number')],))

But why doesn't your approach with GROUP BY work?

In the original query, the reason Django (and the database; the SQL you quoted would throw a syntax error) insists on grouping by x is that if you group by category and number, you may have several x values for one grouped row of category and number. Which one should the DB select? It cannot make that choice for you.

If x is not important, you could leave it out. If it is important but always has the same value for a group of category and number, then grouping the query by x doesn't hurt you. If there are different x values and they are important, you need to make up your mind about which one to select (and tell the DB accordingly). The same goes for y.

Endre Both
  • 5,540
  • 1
  • 26
  • 31