43

UPDATE

Thanks to the posted answer, I found a much simpler way to formulate the problem. The original question can be seen in the revision history.

The problem

I am trying to translate an SQL query into Django, but am getting an error that I don't understand.

Here is the Django model I have:

class Title(models.Model):
  title_id = models.CharField(primary_key=True, max_length=12)
  title = models.CharField(max_length=80)
  publisher = models.CharField(max_length=100)
  price = models.DecimalField(decimal_places=2, blank=True, null=True)

I have the following data:

publisher                    title_id      price  title
---------------------------  ----------  -------  -----------------------------------
New Age Books                PS2106         7     Life Without Fear
New Age Books                PS2091        10.95  Is Anger the Enemy?
New Age Books                BU2075         2.99  You Can Combat    Computer Stress!
New Age Books                TC7777        14.99  Sushi, Anyone?
Binnet & Hardley             MC3021         2.99  The Gourmet Microwave
Binnet & Hardley             MC2222        19.99  Silicon Valley   Gastronomic Treats
Algodata Infosystems         PC1035        22.95  But Is It User Friendly?
Algodata Infosystems         BU1032        19.99  The Busy Executive's   Database Guide
Algodata Infosystems         PC8888        20     Secrets of Silicon Valley

Here is what I want to do: introduce an annotated field dbl_price which is twice the price, then group the resulting queryset by publisher, and for each publisher, compute the total of all dbl_price values for all titles published by that publisher.

The SQL query that does this is as follows:

SELECT SUM(dbl_price) AS total_dbl_price, publisher
FROM (
  SELECT price * 2 AS dbl_price, publisher
  FROM title
) AS A 
GROUP BY publisher

The desired output would be:

publisher                    tot_dbl_prices
---------------------------  --------------
Algodata Infosystems                 125.88
Binnet & Hardley                      45.96
New Age Books                         71.86 

Django query

The query would look like:

Title.objects
 .annotate(dbl_price=2*F('price'))
 .values('publisher')
 .annotate(tot_dbl_prices=Sum('dbl_price'))

but gives an error:

KeyError: 'dbl_price'. 

which indicates that it can't find the field dbl_price in the queryset.

The reason for the error

Here is why this error happens: the documentation says

You should also note that average_rating has been explicitly included in the list of values to be returned. This is required because of the ordering of the values() and annotate() clause.

If the values() clause precedes the annotate() clause, any annotations will be automatically added to the result set. However, if the values() clause is applied after the annotate() clause, you need to explicitly include the aggregate column.

So, the dbl_price could not be found in aggregation, because it was created by a prior annotate, but wasn't included in values().

However, I can't include it in values either, because I want to use values (followed by another annotate) as a grouping device, since

If the values() clause precedes the annotate(), the annotation will be computed using the grouping described by the values() clause.

which is the basis of how Django implements SQL GROUP BY. This means that I can't include dbl_price inside values(), because then the grouping will be based on unique combinations of both fields publisher and dbl_price, whereas I need to group by publisher only.

So, the following query, which only differs from the above in that I aggregate over model's price field rather than annotated dbl_price field, actually works:

Title.objects
 .annotate(dbl_price=2*F('price'))
 .values('publisher')
 .annotate(sum_of_prices=Count('price'))

because the price field is in the model rather than being an annotated field, and so we don't need to include it in values to keep it in the queryset.

The question

So, here we have it: I need to include annotated property into values to keep it in the queryset, but I can't do that because values is also used for grouping (which will be wrong with an extra field). The problem essentially is due to the two very different ways that values is used in Django, depending on the context (whether or not values is followed by annotate) - which is (1) value extraction (SQL plain SELECT list) and (2) grouping + aggregation over the groups (SQL GROUP BY) - and in this case these two ways seem to conflict.

My question is: is there any way to solve this problem (without things like falling back to raw sql)?

Please note: the specific example in question can be solved by moving all annotate statements after values, which was noted by several answers. However, I am more interested in solutions (or discussion) which would keep the annotate statement(s) before values(), for three reasons: 1. There are also more complex examples, where the suggested workaround would not work. 2. I can imagine situations, where the annotated queryset has been passed to another function, which actually does GROUP BY, so that the only thing we know is the set of names of annotated fields, and their types. 3. The situation seems to be pretty straightforward, and it would surprise me if this clash of two distinct uses of values() has not been noticed and discussed before.

Community
  • 1
  • 1
Leonid Shifrin
  • 22,449
  • 4
  • 68
  • 100
  • There's one thing you can do if you haven't already, is print the constructed SQL string of the queryset when it is evaluated, this way you can try switching the order of things untill you get the original SQL query you're trying to emulate – Mojimi Mar 30 '17 at 18:41
  • 1
    @Mojimi Thanks for the suggestion. But I am not really interested in get that particular example above to work, by trying. I am interested to understand how to make this work in general, preferably using only documented user-level Django capabilities, or that this just can't be done, for a general class of queries where you get some annotated property and then want to aggregate over it in GROUP BY. – Leonid Shifrin Mar 30 '17 at 20:42
  • @LeonidShifrin As per the chat I can conclude you found raw query was the way to go. If that is the case post an answer here stating no other alternatives were found and mark it as the accepted one. – SomeTypeFoo Mar 31 '17 at 09:21
  • @ThulasiRam I am not sure yet. Raw queries are the last resort, to my mind, and I wouldn't be happy posting that as an answer unless I am sure it's the only one. I didn't have the time yet to try other things. Once I get a satisfactory answer, I will surely post it here, unless someone else posts it first. – Leonid Shifrin Mar 31 '17 at 21:14

4 Answers4

50

Update: Since Django 2.1, everything works out of the box. No workarounds needed and the produced query is correct.

This is maybe a bit too late, but I have found the solution (tested with Django 1.11.1).

The problem is, call to .values('publisher'), which is required to provide grouping, removes all annotations, that are not included in .values() fields param.

And we can't include dbl_price to fields param, because it will add another GROUP BY statement.

The solution in to make all aggregation, which requires annotated fields firstly, then call .values() and include that aggregations to fields param(this won't add GROUP BY, because they are aggregations). Then we should call .annotate() with ANY expression - this will make django add GROUP BY statement to SQL query using the only non-aggregation field in query - publisher.

Title.objects
    .annotate(dbl_price=2*F('price'))
    .annotate(sum_of_prices=Sum('dbl_price'))
    .values('publisher', 'sum_of_prices')
    .annotate(titles_count=Count('id'))

The only minus with this approach - if you don't need any other aggregations except that one with annotated field - you would have to include some anyway. Without last call to .annotate() (and it should include at least one expression!), Django will not add GROUP BY to SQL query. One approach to deal with this is just to create a copy of your field:

Title.objects
    .annotate(dbl_price=2*F('price'))
    .annotate(_sum_of_prices=Sum('dbl_price')) # note the underscore!
    .values('publisher', '_sum_of_prices')
    .annotate(sum_of_prices=F('_sum_of_prices')

Also, mention, that you should be careful with QuerySet ordering. You'd better call .order_by() either without parameters to clear ordering or with you GROUP BY field. If the resulting query will contain ordering by any other field, the grouping will be wrong. https://docs.djangoproject.com/en/1.11/topics/db/aggregation/#interaction-with-default-ordering-or-order-by

Also, you might want to remove that fake annotation from your output, so call .values() again. So, final code looks like:

Title.objects
    .annotate(dbl_price=2*F('price'))
    .annotate(_sum_of_prices=Sum('dbl_price'))
    .values('publisher', '_sum_of_prices')
    .annotate(sum_of_prices=F('_sum_of_prices'))
    .values('publisher', 'sum_of_prices')
    .order_by('publisher')
Alexandr Tatarinov
  • 3,946
  • 1
  • 15
  • 30
  • 10
    This is phenomenal! You have cracked a problem that has plagued a lot of django developers for a long time. – David Mar 04 '18 at 10:13
  • 5
    This is probably one of the best answers to understand some ORM techniques I've seen on SO. – nael Aug 24 '18 at 04:08
  • 1
    Sorry for coming around so late. I have seen your answer for a long time, but I have moved from Django to SQLAlchemy very soon after asking this question. This does not excuse such a long delay though. Very elegant solution which reveals some internal logic and extends the limits of what seemed possible. Many thanks. – Leonid Shifrin Jun 05 '19 at 00:33
  • In my case hint with using `F()` in last annotation didn't work, I had to use true aggregation to trigger proper group by eg. `Count()`. – amonowy Nov 08 '19 at 11:38
5

This is expected from the way group_by works in Django. All annotated fields are added in GROUP BY clause. However, I am unable to comment on why it was written this way.

You can get your query to work like this:

Title.objects
  .values('publisher')
  .annotate(total_dbl_price=Sum(2*F('price'))

which produces following SQL:

SELECT publisher, SUM((2 * price)) AS total_dbl_price
FROM title
GROUP BY publisher

which just happens to work in your case.

I understand this might not be the complete solution you were looking for, but some even complex annotations can also be accommodated in this solution by using CombinedExpressions(I hope!).

lumos42
  • 106
  • 8
  • 1
    Thanks! This is more or less what [Thulasi Ram](http://stackoverflow.com/users/6323666/thulasi-ram) has suggested in his answer. This solves the particular example I have in the question, but this does not solve the general case I am interested in (see the comments below the mentioned answer, particularly exchange between @SergGr and myself). Still, your answer increases my confidence in that I will need to dig into the Django source code and try other things, since it looks like there isn't a simple way to fix this situation in general, using only top-level user-exposed Django functionality. – Leonid Shifrin Mar 29 '17 at 19:55
  • It also looks like this problem is really hard to solve mostly when one has something like nested GROUP BY or other cases where initially annotated field is actually an aggregation over several fields (possibly for a related model). In cases where annotated fields only are computed over a single row of the original table, the annotation seems to indeed commute with GROUP BY, so that your suggested solution would indeed work. – Leonid Shifrin Mar 29 '17 at 19:59
3

Your problem comes from values() follow by annotate(). Order are important. This is explain in documentation about [order of annotate and values clauses]( https://docs.djangoproject.com/en/1.10/topics/db/aggregation/#order-of-annotate-and-values-clauses)

.values('pub_id') limit the queryset field with pub_id. So you can't annotate on income

The values() method takes optional positional arguments, *fields, which specify field names to which the SELECT should be limited.

Wilfried
  • 1,623
  • 1
  • 12
  • 19
  • Thanks for an attempt, but this isn't true. If I annotate on other fields like `title_id`, everything works (if I change `annotate(total_income=Sum('income'))` to `annotate(total_titles=Count('title_id'))` - and `title_id` isn't in the list of fields in `values` either). And yes, I know that the order is important, and specifically that `annotate` coming after `values` is used to implement `GROUP BY` + aggregation in Django - which is what I need in this case. – Leonid Shifrin Mar 24 '17 at 21:34
  • allright. My bad :) Question. Title_id is a field of your model, and income is an aggregation. This can be a reason why title_id can work in this case ? – Wilfried Mar 24 '17 at 21:35
  • 1
    Actually, you may be partially right. The docs say *If the values() clause precedes the annotate() clause, any annotations will be automatically added to the result set. However, if the values() clause is applied after the annotate() clause, you need to explicitly include the aggregate column*, and so it looks like I need to list the aggregated column explicitly in `values()`. But then, there seems to be no way to only group by the non-aggregate field. – Leonid Shifrin Mar 24 '17 at 21:39
  • Yes, this certainly can be the reason. And I think that your reference to the docs kind of explains it. I seem to need to explicitly include the aggregate field into `values()`. But the problem is, that `values` is also used for GROUP BY - and then should only contain fields we are grouping by. So, I need to include aggregated field into `values()` to preserve it, but I need to exclude it to not group by it. This is the problem. And it comes from the fact that `values` is used in 2 very different ways depending on whether or not followed by `annotate`. In this case, these ways seem to conflict – Leonid Shifrin Mar 24 '17 at 21:42
  • Thank a lot for taking the time to answer. I think this in any case pointed me to the right direction. Much appreciated. – Leonid Shifrin Mar 25 '17 at 00:07
  • You are welcomme. Can you add you desire output for your request ? – Wilfried Mar 25 '17 at 09:23
  • Sure. I have added sample data and the desired output, to the question. – Leonid Shifrin Mar 25 '17 at 18:25
  • I did not test yet, but do you tried: `Title.objects.annotate(tot_dbl_price=Sum(2*F('price'))).values('publisher')` – Wilfried Mar 26 '17 at 09:35
  • This won't work as intended, because the `annotate` here will compute over single instances of `Title`, so `Sum` is a trivial sum of a single element. So the result of this would be exactly the same as for `Title.objects.annotate(tot_dbl_price=2*F('price')).valu‌​es('publisher')` (without `Sum`), and `values` here would just extract the `publisher` field from the resulting queryset. What I am after is that you first add an annotation, *then* use `values` and the second `annotate` to group titles by the publisher and compute the sum over those groups. And the problem is, as I explained in ... – Leonid Shifrin Mar 26 '17 at 10:46
  • ... the question, that, on one hand, you must add all annotated fields to `values`, to keep them in the queryset, but on the other hand, if I use `values('publisher', 'dbl_price').annotate(...)`, then it will group by unique combinations of both `publisher` and `dbl_price` fields, which is not what I need - I only need to group by `publisher`. In other words, in Django one can't express something like the following: `SELECT SUM(dbl_price) AS total_dbl_price, publisher_id FROM ( SELECT price * 2 AS dbl_price, publisher_id FROM title ) AS A GROUP BY publisher_id ` – Leonid Shifrin Mar 26 '17 at 10:52
  • I rewrote the question a bit, for clarity, and added an equivalent SQL query. – Leonid Shifrin Mar 26 '17 at 11:08
1

This solution by @alexandr addresses it properly.

https://stackoverflow.com/a/44915227/6323666

What you require is this:

from django.db.models import Sum

Title.objects.values('publisher').annotate(tot_dbl_prices=2*Sum('price'))

Ideally I reversed the scenario here by summing them up first and then doubling it up. You were trying to double it up then sum up. Hope this is fine.

SomeTypeFoo
  • 886
  • 7
  • 16
  • Well, the real problem I was interested in was not how to perform this particular operation. The real problem is how to group by, computing aggregation on an annotated field rather than model field. The problem in the question is just an example to illustrate the difficulty. Thanks for the reply, you get my +1 for sure, but this doesn't really answer the main difficulty: when you group by, you use `values` + `annotate`. In `values`, you only list fields you group over. But, all previously annotated properties are lost unless also listed in `values`. – Leonid Shifrin Mar 28 '17 at 17:00
  • 1
    So, when you have `annotate(prop=...).values('otherprop').annotate(my_aggregate= Sum('prop'))`, you just can't do that, because by the time you do the second `annotate`, the `prop` has been lost already. To not lose it, you have to include it into `values`. But then, you are not grouping on `otherprop`, but on unique combinations of `otherprop` and `prop`. The problem is there because `values` is used in 2 different ways in Django, and in this particular setup, they do conflict. I was wondering whether there exists idiomatic solution to this general problem, without falling back to raw sql. – Leonid Shifrin Mar 28 '17 at 17:04
  • @LeonidShifrin, out of curiosity, could you provide an example of when the approach suggested by Thulasi i.e. moving all the annotations after groupping (i.e. `values` call) doesn't work? – SergGr Mar 28 '17 at 18:32
  • @SergGr I think I can. But it will be a somewhat more complex. Imagine that we have a de-facto M2M relation between `Title` model and `Author` model, going through intermediate `TitleAuthor` model that has fields like `author` and `title` which are foreign keys to these, plus may be some additional fields. Now, I may want to, for example, annotate each book with its number of authors, then group by the publisher, and then compute the average number of authors for each publisher. – Leonid Shifrin Mar 28 '17 at 20:11
  • @SergGr [2] Had the GROUP BY aggregation in Django worked with annotated fields, and I would do that as follows (note that `titleauthor__author` is a reverse lookup which does implicit join): `Title.objects.annotate(num_authors=Count(F('titleauthor__author'))).values('publisher').annotate(avgauth_num = Avg(F('num_authors')))`. But I don't see how I can do that if I try to move all annotation after grouping. – Leonid Shifrin Mar 28 '17 at 20:13
  • @SergGr [3] This: `Title.objects.values('publisher').annotate(av_auth=ExpressionWrapper(Count(F('titleauthor__author'))/Count(F('title')), output_field=FloatField()))` does not give the right result, because there can be titles which have no authors (yet indicated), but at the aggregation time, I can't set up proper filtering. And in any case, had I wanted to compute an average of say, squares of the numbers of authors for each book for a given publisher, and this could not possibly work, because one would not be able to obtain those individual squares at aggregation time. – Leonid Shifrin Mar 28 '17 at 20:23
  • @LeonidShifrin, Thank you for the example. AFAIU the real difficulty with it is that unlike your question's example, this one implicitly requires two `GROUP BY` instead of just one because your inner query also does some aggregation rather than simple math function and thus you can't move it (the inner aggregation) to the outer query. – SergGr Mar 28 '17 at 20:55
  • @SergGr Yes, you are right. I don't know whether one can construct a more trivial example where annotations and GROUP BY don't commute, it may or may not be possible, didn't think enough about it. But regardless of this, my main question / point is that Django seems to not allow to aggregate over annotated properties, which is rather unfortunate and limiting. I was / am hoping for some workarounds and also explanations of whether this is a design oversight, or there is some rationale for this behavior. Of course, I can just dig into Django sources, but I wanted to check with others first. – Leonid Shifrin Mar 28 '17 at 21:10
  • @SergGr Actually, there is a much simpler example: annotate each title with a square of its price, and then compute the average squared price for books for a given publisher: `Title.objects.annotate(sq_price=F('price') * F('price')).values('publisher').annotate(avg_sq_price=Avg('sq_price'))`. I would not know how to do even this if I try to do all aggregation after `values()`. – Leonid Shifrin Mar 28 '17 at 21:17
  • 1
    @LeonidShifrin, I'm not a Django expert but after fast glance over the code of [`model.Query`](https://github.com/django/django/blob/master/django/db/models/query.py) and [`model.sql.Query`](https://github.com/django/django/blob/master/django/db/models/sql/query.py) I came to a conclussion that they are not designed to generate subqueries at all with the only exception of `model.Query.aggregate` call which uses `model.sql.Query.get_aggregation` that you might find interesting for inspiration if you want a workaround on a bit higher level than plain SQL. – SergGr Mar 28 '17 at 21:18
  • @SergGr Thanks for the pointer. Actually, Django 11 introduces `SubQuery`, which allows in particular, correlated subqueries. One can also do subqueries via `filter(field__in =another-queryset) `. But yes, I will look into those. My interest was / is mainly to see whether GROUP BY on annotated properties can be made to work in all generality, rather than any particular query, and whether I am missing something obvious to fix this, but I am probably not. So indeed, I may need to just look into the code. – Leonid Shifrin Mar 28 '17 at 21:22
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/139308/discussion-between-serggr-and-leonid-shifrin). – SergGr Mar 28 '17 at 21:22