20

In the following setup, I'd like a QuerySet with a list of projects, each annotated with the sum of all its task durations (as tasks_duration) and the sum of all of its tasks' subtask durations (as subtasks_duration). My models (simplified) look like this:

class Project(models.Model):
    pass

class Task(models.Model):
    project = models.ForeignKey(Project)
    duration = models.IntegerField(blank=True, null=True)

class SubTask(models.Model):
    task = models.ForeignKey(Task)
    duration = models.IntegerField(blank=True, null=True)

I make my QuerySet like this:

Projects.objects.annotate(tasks_duration=Sum('task__duration'), subtasks_duration=Sum('task__subtask__duration'))

Related to the behaviour explained in Django annotate() multiple times causes wrong answers I get a tasks_duration that is much higher than it should be. The multiple annotate(Sum()) clauses yield multiple left inner joins in the resultant SQL. With only a single annotate(Sum()) term for tasks_duration, the result is correct. However, I'd like to have both tasks_duration and subtasks_duration.

What would be a suitable way to do this query? I have a working solution that does it per-project, but that's expectedly unusably slow. I also have something similar working with an extra() call, but I'd really like to know if what I want is possible with pure Django.

Community
  • 1
  • 1
Charl Botha
  • 4,373
  • 34
  • 53
  • Did you try `tasks_duration=Sum('task__duration', distinct=True), subtasks_duration=Sum('task__subtask__duration', distinct=True)` as mentioned in the other question you linked ? – jpic Aug 24 '12 at 12:23
  • That will only serve to sum distinct duration values, which is not what I want. Out of curiosity I did try it, still yields the same incorrect values. (the durations do vary) – Charl Botha Aug 24 '12 at 14:30
  • Ever find a workaround for this? I'm trying to annotate a sum and a count on one queryset, and the sum keeps getting multiplied... – StephenTG Mar 18 '14 at 19:03
  • I never got this particular construction working I'm afraid. I just checked the code from which my simple example above was derived: I ended up doing the tally in an extra() clause. – Charl Botha Mar 19 '14 at 13:36
  • You don't need extra. Read my updated answer, please. @CharlBotha – Happy Ahmad Jul 27 '18 at 04:50
  • 1
    The ORM `Subquery` construct shipped in Django 1.11. When I posted my question in 2012, Django 1.11 was still far in the future! – Charl Botha Jul 27 '18 at 08:35
  • @CharlBotha Aren't you going to accept an answer as the best answer? – Happy Ahmad Feb 20 '22 at 07:36

2 Answers2

17

The bug is reported here but it's not solved yet even in Django 1.11. The issue is related to joining two tables in reverse relations. Notice that distinct parameter works well for Count but not for Sum. So you can use a trick and write an ORM like below:

 Projects.objects.annotate(
      temp_tasks_duration=Sum('task__duration'),
      temp_subtasks_duration=Sum('task__subtask__duration'),
      tasks_count=Count('task'),
      tasks_count_distinct=Count('task', distinct=True),
      task_subtasks_count=Count('task__subtask'),
      task_subtasks_count_distinct=Count('task__subtask', distinct=True),
 ).annotate(
      tasks_duration=F('temp_tasks_duration')*F('tasks_count_distinct')/F('tasks_count'),
      subtasks_duration=F('temp_subtasks_duration')*F('subtasks_count_distinct')/F('subtasks_count'),
 )

Update: I found that you need to use Subquery. In the following solution, firstly you filter tasks for related to the outerref (OuterRef references to the outer query, so the tasks are filtered for each Project), then you group the tasks by 'project', so that the Sum applies on all the tasks of each projects and returns just one result if any task exists for the project (you have filtered by 'project' and then grouped by that same field; That's why just one group can be there.) or None otherwise. The result would be None if the project has no task, that means we can not use [0] to select the calculated sum.

from django.db.models import Subquery, OuterRef
Projects.objects.annotate(
    tasks_duration=Subquery(
        Task.objects.filter(
            project=OuterRef('pk')
        ).values(
            'project'
        ).annotate(
            the_sum=Sum('task__duration'),
        ).values('the_sum')[:1]
    ),
    subtasks_duration=Sum('task__subtask__duration')
)

Running this code will send just one query to the database, so the performance is great.

Happy Ahmad
  • 1,072
  • 2
  • 14
  • 33
  • Thank you for coming back here to update your answer. It looks OK, and I would like to select it as the answer, but I am not able to test it at this moment. Do you have a test setup with a few sample projects -> tasks -> subtasks that shows the correct tallying up of all times? Ideally also the SQL that Django generates would be great to see in the answer, then other readers know what to expect. – Charl Botha Jul 27 '18 at 08:42
  • You are welcome dear @CharlBotha. Yes, I have this in my laptop, but I'm not sure how can I present it to you. Is there something like Fiddle for python? – Happy Ahmad Jul 29 '18 at 19:31
  • Couldn't you check my answer yet? @CharlBotha – Happy Ahmad Oct 26 '18 at 15:24
  • 1
    @Ahmad - I've been struggling on a similar problem for several hours and this just helped me SO much. Thank you. – Andrew Oct 31 '18 at 15:25
1

I get this error as well. Exact same code. It works if I do the aggregation separately, but once I try to get both sums at the same time, one of them gets a factor 2 higher, and the other a factor 3.

I have no idea why Django behaves this way. I have filed a bug report here: https://code.djangoproject.com/ticket/19011 You might be interested in following it as well.

tBuLi
  • 2,295
  • 2
  • 16
  • 16