I'm trying to convert as much raw SQL to use the Django ORM as I can, and I've run into a snag. I'm trying to perform a query similar to this one:
SELECT table.x,
MAX(table.y) AS y,
table.group_category,
table.group_number,
FROM table
GROUP BY table.group_category, table.group_number
So far, what I've been trying has been some permutation of this:
q = MyModel.objects\
.filter(**filter_kwargs)\
.values('group_category', 'group_number')\
.annotate(y=Max('y'))\
.values('x','y','group_category','group_number')
However, this doesn't seem to work. If I exclude the last values()
, it produces the following (roughly):
SELECT MAX(table.y) AS y,
table.group_category,
table.group_number,
FROM table
GROUP BY table.group_category, table.group_number
It doesn't select table.x
. But if I include the last values()
...
SELECT table.x,
MAX(table.y) AS y,
table.group_category,
table.group_number,
FROM table
GROUP BY x, y, table.group_category, table.group_number
It groups by x, y
. So what clearly seems to be happening is that the all of the values are replaced and annotate uses whatever values the QuerySet is given (because it's evaluated lazily?). The docs on aggregation and values seem to suggest that doing the two values functions in this order would have the desired effect, and I found a writeup (from 2013) that also suggests this. Am I doing something wrong? Is this still possible in the Django ORM? Is there any way for me to do this without using extra() or raw SQL? I tried to keep this example as simple as possible for demonstration purposes, but my actual problem involves a JOIN. Could that be complicating it?
UPDATE 1
I was able to kinda figure it out, however, it still doesn't produce the best version (above) of the SQL query that I want. In order to get the results I needed, I instead perform a query to get MAX(table.y)
, and then use that as a subquery using __in
against the subquery's value. The subquery does the grouping.
filtered = MyModel.objects.filter(**filter_kwargs)
subq = filtered\
.values('group_category', 'group_number')\
.annotate(y=Max('y'))\
.values_list('y', flat=True)
q = filtered\
.filter(y__in=subq)\
.values('x','y','group_category','group_number')
As I said, that works as in it gets me the results I need. The problem is that it's much slower than just having SELECTs different than the GROUP BYs, because it creates a relatively bulky subquery. I'm not marking this as the answer just yet because it still doesn't produce a query that matches what I really want. Instead, it looks something like this:
SELECT table.x,
table.y,
table.group_category,
table.group_number,
FROM table
WHERE y IN
(SELECT MAX(U0.y) AS y
FROM table U0
GROUP BY U0.group_category, U0.group_number)
Furthermore, it doesn't look like I can even use extra() because it will similarly only add columns to the SELECT clause that are already part of the QuerySet, i.e. values()
.
UPDATE 2
It turns out that my messy workaround does not work, because it gets the MAX of all the y (1 row) and returns it, rather than grouping them together by group_category and group_number and using the MAX
of their y
's, so I'm back to the drawing board.