25

While constructing a complexe QuerySet with several annotations, I ran into an issue that I could reproduce with the following simple setup.

Here are the models:

class Player(models.Model):
    name = models.CharField(max_length=200)

class Unit(models.Model):
    player = models.ForeignKey(Player, on_delete=models.CASCADE,
                               related_name='unit_set')
    rarity = models.IntegerField()

class Weapon(models.Model):
    unit = models.ForeignKey(Unit, on_delete=models.CASCADE,
                             related_name='weapon_set')

With my test database, I obtain the following (correct) results:

Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))

[{'id': 1, 'name': 'James', 'weapon_count': 23},
 {'id': 2, 'name': 'Max', 'weapon_count': 41},
 {'id': 3, 'name': 'Bob', 'weapon_count': 26}]


Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))

[{'id': 1, 'name': 'James', 'rarity_sum': 42},
 {'id': 2, 'name': 'Max', 'rarity_sum': 89},
 {'id': 3, 'name': 'Bob', 'rarity_sum': 67}]

If I now combine both annotations in the same QuerySet, I obtain a different (inaccurate) results:

Player.objects.annotate(
    weapon_count=Count('unit_set__weapon_set', distinct=True),
    rarity_sum=Sum('unit_set__rarity'))

[{'id': 1, 'name': 'James', 'weapon_count': 23, 'rarity_sum': 99},
 {'id': 2, 'name': 'Max', 'weapon_count': 41, 'rarity_sum': 183},
 {'id': 3, 'name': 'Bob', 'weapon_count': 26, 'rarity_sum': 113}]

Notice how rarity_sum have now different values than before. Removing distinct=True does not affect the result. I also tried to use the DistinctSum function from this answer, in which case all rarity_sum are set to 18 (also inaccurate).

Why is this? How can I combine both annotations in the same QuerySet?

Edit: here is the sqlite query generated by the combined QuerySet:

SELECT "sandbox_player"."id",
       "sandbox_player"."name",
       COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
       SUM("sandbox_unit"."rarity")          AS "rarity_sum"
FROM "sandbox_player"
         LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
         LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

The data used for the results above is available here.

abey
  • 593
  • 10
  • 26
  • 1
    I tried to reproduce the behavior. Unfortunately, I got the correct result. [**Screenshot of Django Shell**](https://i.stack.imgur.com/WHMUg.png). Am I missing something? – JPG Jun 15 '19 at 13:57
  • @JPG: now this is weird. Which setup are you using? I'm using Django 2.2.2, Python 3.7.3 and an sqlite backend. 100% vanilla project with Django_extensions app included and the models above. – abey Jun 15 '19 at 14:19
  • **`Python 3.6.7`** and **`Django 2.2.1`** along with ***sqlite*** DB. – JPG Jun 15 '19 at 14:21
  • Do you want to try with the [exact same data](https://pastebin.com/qf5e6vnS) as me? – abey Jun 15 '19 at 14:25
  • Also, do you get the same SQL query (see my edited question)? Can you run my SQL query against your db to check if you get consistent results? – abey Jun 15 '19 at 14:35
  • Yeah. Now I got the same result as you. I – JPG Jun 15 '19 at 14:39
  • I run into this behaviour also although I did not have the time to put up a simple example to report the bug. My guess is that Django ORM is doing some weird stuff and constructing bad queries. In order to avoid this I settled to use `Subquery`. In that way I ensure that the query performed is exactly the one that I want – ivissani Jun 15 '19 at 18:48
  • @ivissani: how would you go about it int his case? – abey Jun 15 '19 at 19:43

4 Answers4

54

This isn't the problem with Django ORM, this is just the way relational databases work. When you're constructing simple querysets like

Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))

or

Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))

ORM does exactly what you expect it to do - join Player with Weapon

SELECT "sandbox_player"."id", "sandbox_player"."name", COUNT("sandbox_weapon"."id") AS "weapon_count"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" 
    ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon" 
    ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

or Player with Unit

SELECT "sandbox_player"."id", "sandbox_player"."name", SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

and perform either COUNT or SUM aggregation on them.

Note that although the first query has two joins between three tables, the intermediate table Unit is neither in columns referenced in SELECT, nor in the GROUP BY clause. The only role that Unit plays here is to join Player with Weapon.

Now if you look at your third queryset, things get more complicated. Again, as in the first query the joins are between three tables, but now Unit is referenced in SELECT as there is SUM aggregation for Unit.rarity:

SELECT "sandbox_player"."id",
       "sandbox_player"."name",
       COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
       SUM("sandbox_unit"."rarity")          AS "rarity_sum"
FROM "sandbox_player"
         LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
         LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

And this is the crucial difference between the second and the third queries. In the second query, you're joining Player to Unit, so a single Unit will be listed once for each player that it references.

But in the third query you're joining Player to Unit and then Unit to Weapon, so not only a single Unit will be listed once for each player that it references, but also for each weapon that references Unit.

Let's take a look at the simple example:

insert into sandbox_player values (1, "player_1");

insert into sandbox_unit values(1, 10, 1);

insert into sandbox_weapon values (1, 1), (2, 1);

One player, one unit and two weapons that reference the same unit.

Confirm that the problem exists:

>>> from sandbox.models import Player
>>> from django.db.models import Count, Sum

>>> Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2}]>

>>> Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'rarity_sum': 10}]>


>>> Player.objects.annotate(
...     weapon_count=Count('unit_set__weapon_set', distinct=True),
...     rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 20}]>

From this example it's easy to see that the problem is that in the combined query the unit will be listed twice, one time for each of the weapons referencing it:

sqlite> SELECT "sandbox_player"."id",
   ...>        "sandbox_player"."name",
   ...>        "sandbox_weapon"."id",
   ...>        "sandbox_unit"."rarity"
   ...> FROM "sandbox_player"
   ...>          LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
   ...>          LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id");
id          name        id          rarity    
----------  ----------  ----------  ----------
1           player_1    1           10        
1           player_1    2           10   

What should you do?

As @ivissani mentioned, one of the easiest solutions would be to write subqueries for each of the aggregations:

>>> from django.db.models import Count, IntegerField, OuterRef, Subquery, Sum
>>> weapon_count = Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).filter(pk=OuterRef('pk'))
>>> rarity_sum = Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).filter(pk=OuterRef('pk'))
>>> qs = Player.objects.annotate(
...     weapon_count=Subquery(weapon_count.values('weapon_count'), output_field=IntegerField()),
...     rarity_sum=Subquery(rarity_sum.values('rarity_sum'), output_field=IntegerField())
... )
>>> qs.values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 10}]>

which produces the following SQL

SELECT "sandbox_player"."id", "sandbox_player"."name", 
(
    SELECT COUNT(U2."id") AS "weapon_count"
    FROM "sandbox_player" U0 
    LEFT OUTER JOIN "sandbox_unit" U1
        ON (U0."id" = U1."player_id")
    LEFT OUTER JOIN "sandbox_weapon" U2 
        ON (U1."id" = U2."unit_id")
    WHERE U0."id" = ("sandbox_player"."id") 
    GROUP BY U0."id", U0."name"
) AS "weapon_count", 
(
    SELECT SUM(U1."rarity") AS "rarity_sum"
    FROM "sandbox_player" U0
    LEFT OUTER JOIN "sandbox_unit" U1
        ON (U0."id" = U1."player_id")
    WHERE U0."id" = ("sandbox_player"."id")
GROUP BY U0."id", U0."name") AS "rarity_sum"
FROM "sandbox_player"
jarekwg
  • 375
  • 3
  • 8
rktavi
  • 752
  • 7
  • 11
  • Thanks for your answer. Your two subqueries miss a `.filter(pk=OuterRef('pk'))` step to yield accurate results with my data, though. Also, why is `distinct=True` required on the subquery? – abey Jun 16 '19 at 14:03
  • 4
    Also, for records. Although I understand what you mean, I disagree that this is not a problem of Django's ORM. I feel that my original combine query captures pretty well my intent, and should work out of the box. As a matter of fact, I found out that is officially considered a [bug](https://code.djangoproject.com/ticket/10060) and referenced in the [official documentation](https://docs.djangoproject.com/en/2.2/topics/db/aggregation/#combining-multiple-aggregations). – abey Jun 16 '19 at 14:07
  • 1
    @abey indeed there should be `.filter` call, otherwise `weapon_count` and `rarity_sum` will be the same for all of the players. That's rather unfortunate, because subquery returns multiple counts (one for each player) and sqlite happily puts the first count into the query while throwing away everything else, so it's quite easy to overlook. PostgreSQL, for example, will throw an error in this case: `more than one row returned by a subquery used as an expression`. I've updated both Python code and SQL. – rktavi Jun 16 '19 at 14:41
  • 1
    @abey distinct seems to have crept there from the samples that I've copied from your question. In the case of subquery `COUNT(DISTINCT ..)` doesn't hurt, but it's rather unnecessary, so I've removed it too. – rktavi Jun 16 '19 at 14:50
  • @abey Whether to consider it a bug or not depends entirely on what your stance regarding ORM is. If you treat `.annotate` call as "put `COUNT` in my query" (which seems quite logical), then everything works as expected. But if by putting `.annotate` you mean "count the objects", then this is indeed a bug. However, user intentions and underlying logic do not have 1-to-1 mapping, so, in my opinion, the ideal solution would be to produce an error or a warning in this case. Transforming the query into subquery or something much complex automatically seems too magical and unexpected to me. – rktavi Jun 16 '19 at 15:02
  • Seeing that splitting the query in two successive `.annotate` statements has the same erroneous behaviour, and that two completely unrelated pieces of code could be adding one such `.annotate` statement each to the same `QuerySet`, I will stand by my position that this is indeed a bug. And I don't want to disagree with maintainers ;) – abey Jun 16 '19 at 17:34
  • Nice findings. +1 from me :) – JPG Jun 17 '19 at 10:46
  • Wonderful explanation! I feel like this should be added to the Django docs :D – Daniel Holmes Sep 17 '19 at 13:26
  • Regardless of if annotate() should "inherently" mean this or that, the fact remains that this bug has for 11 years haunted Django-developers who have wasted dozens of hours each to find it (myself included, just now). It's objectively simply not the expected behaviour. – BjornW May 11 '20 at 13:38
  • I'm not able to get this to work in 3.0.7. Getting `AttributeError: 'IntegerField' object has no attribute 'select_format'` – Rob Jun 13 '20 at 17:22
9

A few notes to complement rktavi's excellent answer:

1) This issues has apparently been considered a bug for 10 years already. It is even referred to in the official documentation.

2) While converting my actual project's QuerySets to subqueries (as per rktavi's answer), I noticed that combining bare-bone annotations (for the distinct=True counts that always worked correctly) with a Subquery (for the sums) yields extremely long processing (35 sec vs. 100 ms) and incorrect results for the sum. This is true in my actual setup (11 filtered counts on various nested relations and 1 filtered sum on a multiply-nested relation, SQLite3) but cannot be reproduced with the simple models above. This issue can be tricky because another part of your code could add an annotation to your QuerySet (e.g a Table.order_FOO() function), leading to the issue.

3) With the same setup, I have anecdotical evidence that subquery-type QuerySets are faster compared to bare-bone annotation QuerySets (in cases where you have only distinct=True counts, of course). I could observe this both with local SQLite3 (83 ms vs 260 ms) and hosted PostgreSQL (320 ms vs 540 ms).

As a result of the above, I will completely avoid using bare-bone annotations in favour of subqueries.

abey
  • 593
  • 10
  • 26
7

Based on the excellent answer from @rktavi, I created two helpers classes that simplify the Subquery/Count and Subquery/Sum patterns:

class SubqueryCount(Subquery):
    template = "(SELECT count(*) FROM (%(subquery)s) _count)"
    output_field = PositiveIntegerField()


class SubquerySum(Subquery):
    template = '(SELECT sum(_sum."%(column)s") FROM (%(subquery)s) _sum)'

    def __init__(self, queryset, column, output_field=None, **extra):
        if output_field is None:
            output_field = queryset.model._meta.get_field(column)
        super().__init__(queryset, output_field, column=column, **extra)

One can use these helpers like so:

from django.db.models import OuterRef

weapons = Weapon.objects.filter(unit__player_id=OuterRef('id'))
units = Unit.objects.filter(player_id=OuterRef('id'))

qs = Player.objects.annotate(weapon_count=SubqueryCount(weapons),
                             rarity_sum=SubquerySum(units, 'rarity'))
Benoit Blanchon
  • 13,364
  • 4
  • 73
  • 81
  • It worked extremely well for me in SQlite and far quicker than the conventional Django Count annotation ... however when I tried in a MariaDB environnement ... it crashed ! MariaDB is apparently not able / not willing to handle correlated subqueries as those are considered sub-optimal. Would anyone be able to help me figure out a way to implement those helper functions for MariaDB ? What should template be in this environnement? – Skratt Mar 08 '21 at 15:47
  • benoit-blanchon, @rktavi : any suggestions how I should modify the Subquery template for having these correlated subqueries work in a MariaDB environnement? – Skratt Mar 09 '21 at 09:54
  • @Skratt, I only tried PostgreSQL, but I'm surprised it doesn't work on MariaDB, what's the error? – Benoit Blanchon Mar 09 '21 at 11:03
  • OperationalError (1054, "Unknown column '.id' in 'where clause'") in python3.7/dist-packages/MySQLdb/connections.py in query (226. _mysql.connection.query(self, query) - Looking into it on MariaDB documentation, I found out that 'correlated subqueries' are not supported by MariaDB (https://mariadb.com/kb/en/subquery-limitations/) – Skratt Mar 10 '21 at 08:25
  • It looks like your table doesn't have an `id` column; did you [change the primary key](https://docs.djangoproject.com/en/3.1/ref/models/fields/#django.db.models.Field.primary_key)? If so, you simply need to replace the column name passed to `OuterRef()`. – Benoit Blanchon Mar 10 '21 at 09:07
  • no, the PK has not been changed. The same code exactly works super efficiently on sqlite – Skratt Mar 11 '21 at 07:33
  • Ok, I was finally able to find the solution and published the solution (Database independant solution) under: https://stackoverflow.com/questions/61677702/how-to-get-a-count-based-on-a-subquery/66614695#66614695. Thanks for your support – Skratt Mar 13 '21 at 14:37
2

Thanks @rktavi for your amazing answer!!

Here's my use case:

Using Django DRF.

I needed to get Sum and Count from different FK's inside the annotate so that it would all be part of one queryset in order to add these fields to the ordering_fields in DRF.

The Sum and Count were clashing and returning wrong results. Your answer really helped me put it all together.

The annotate was occasionally returning the dates as strings, so I needed to Cast it to DateTimeField.

    donation_filter =  Q(payments__status='donated') & ~Q(payments__payment_type__payment_type='coupon')
    total_donated_SQ = User.objects.annotate(total_donated=Sum('payments__sum', filter=donation_filter )).filter(pk=OuterRef('pk'))
    message_count_SQ = User.objects.annotate(message_count=Count('events__id', filter=Q(events__event_id=6))).filter(pk=OuterRef('pk'))
    queryset = User.objects.annotate(
        total_donated=Subquery(total_donated_SQ.values('total_donated'), output_field=IntegerField()),
        last_donation_date=Cast(Max('payments__updated', filter=donation_filter ), output_field=DateTimeField()),
        message_count=Subquery(message_count_SQ.values('message_count'), output_field=IntegerField()),
        last_message_date=Cast(Max('events__updated', filter=Q(events__event_id=6)), output_field=DateTimeField())
    )
Avramo
  • 143
  • 9