I'm bumping up against an issue with a series of grouped annotations on a Django QuerySet which is being filtered by a ManyToMany. Since I've been staring at my screen for too long, lets imagine a series of models about Wine tastings:
class Event(model.Model):
# Some Fields
class Wine(models.Model):
# Some Fields
class Tasting(models.Model):
event = models.ManyToManyField(Event)
wine = models.ForeignKey(Wine)
score = models.IntegerField()
I want to get some aggregations on the data by wine, optionally filtering on certain events. With this sample Tasting data (which is what I'll be running the aggregation on):
| wine_id | score | event_ids |
| ------- | ----- | --------- |
| 1 | 50 | [1] |
| 1 | 50 | [1] |
| 1 | 50 | [1, 2] |
| 2 | 100 | [1, 2] |
| 2 | 150 | [1, 2] |
| 3 | 75 | [1] |
The expected output, for the above data is:
[
{'wine_id': 1, 'total_scores': 150, 'average_scores': 50},
{'wine_id': 2, 'total_scores': 250, 'average_scores': 125},
{'wine_id': 3, 'total_scores': 75, 'average_scores': 75},
]
Attempt 1
Just some regular values
and annotation
Tasting.objects.filter(
event__in=Event.objects.filter(id__in=[1,2])
).distinct().values('wine_id').annotate(
total_scores=Sum('score'),
average_scores=Avg('scores'),
)
Which outputs:
[
{'wine_id': 1, 'total_scores': 200, 'average_scores': 50}, # Total score too high
{'wine_id': 2, 'total_scores': 250, 'average_scores': 125},
{'wine_id': 3, 'total_scores': 75, 'average_scores': 75},
]
Hrm, so it looks like I'm butting up against that same issue that crops up with multiple annotations -- due to the join when filtering the events, one of the wine_1
rows is counted twice: once for each event.
Attempt 2
So, looking at a bunch of suggestions from that Django issue (such as this answer, I figured I could tackle the problem with Subqueries, which led me to this beast:
total_subquery = Subquery(Tasting.objects.filter(wine_id=OuterRef('wine_id')).annotate(
total_scores=Sum('score'),
).values('total_scores'))
average_subquery = Subquery(Tasting.objects.filter(wine_id=OuterRef('wine_id')).annotate(
average_scores=Avg('scores'),
).values('average_scores'))
Tasting.objects.filter(
event__in=Event.objects.filter(id__in=[1,2])
).distinct().values('wine_id').annotate(
total_scores=total_subquery,
average_scores=average_subquery,
)
So, initially, this looked correct:
[
{'wine_id': 1, 'total_scores': 150, 'average_scores': 50},
{'wine_id': 2, 'total_scores': 250, 'average_scores': 125},
{'wine_id': 3, 'total_scores': 75, 'average_scores': 75},
]
Huzzah! BUT, what if we change the filter to only include Event 2:
Tasting.objects.filter(
event__in=Event.objects.filter(id__in=[2])
).distinct().values('wine_id').annotate(
total_scores=total_subquery,
average_scores=average_subquery,
)
In this case, I still get back the data for ALL events. This makes intuitive sense, since the subqueries don't have any knowledge of the outer filter. However, if I change the OuterRef
values on the subquery (to something like filter(pk=OuterRef('pk'))
), then the proper grouping on the subqueries falls apart. If I re-add the event filtering at the subquery level, then we get the same duplicate row issue that we had in our first attempt.
I can get the proper values by simple fetching all the data and then doing the aggregation in Python, but this has serious performance costs for larger datasets. Is there a way to do this aggregation entirely through the ORM?