4

I'm bumping up against an issue with a series of grouped annotations on a Django QuerySet which is being filtered by a ManyToMany. Since I've been staring at my screen for too long, lets imagine a series of models about Wine tastings:

class Event(model.Model):
     # Some Fields

class Wine(models.Model):
     # Some Fields

class Tasting(models.Model):
    event = models.ManyToManyField(Event)
    wine = models.ForeignKey(Wine)
    score = models.IntegerField()

I want to get some aggregations on the data by wine, optionally filtering on certain events. With this sample Tasting data (which is what I'll be running the aggregation on):

| wine_id | score | event_ids |
| ------- | ----- | --------- |
| 1       | 50    | [1]       |
| 1       | 50    | [1]       |
| 1       | 50    | [1, 2]    |
| 2       | 100   | [1, 2]    |
| 2       | 150   | [1, 2]    |
| 3       | 75    | [1]       |

The expected output, for the above data is:

[
    {'wine_id': 1, 'total_scores': 150, 'average_scores': 50},
    {'wine_id': 2, 'total_scores': 250, 'average_scores': 125},
    {'wine_id': 3, 'total_scores': 75, 'average_scores': 75},
]

Attempt 1

Just some regular values and annotation

Tasting.objects.filter(
   event__in=Event.objects.filter(id__in=[1,2])
).distinct().values('wine_id').annotate(
   total_scores=Sum('score'),
   average_scores=Avg('scores'),
)

Which outputs:

[
    {'wine_id': 1, 'total_scores': 200, 'average_scores': 50},  # Total score too high
    {'wine_id': 2, 'total_scores': 250, 'average_scores': 125},
    {'wine_id': 3, 'total_scores': 75, 'average_scores': 75},
]

Hrm, so it looks like I'm butting up against that same issue that crops up with multiple annotations -- due to the join when filtering the events, one of the wine_1 rows is counted twice: once for each event.

Attempt 2

So, looking at a bunch of suggestions from that Django issue (such as this answer, I figured I could tackle the problem with Subqueries, which led me to this beast:

total_subquery = Subquery(Tasting.objects.filter(wine_id=OuterRef('wine_id')).annotate(
   total_scores=Sum('score'),
).values('total_scores'))

average_subquery = Subquery(Tasting.objects.filter(wine_id=OuterRef('wine_id')).annotate(
   average_scores=Avg('scores'),
).values('average_scores'))

Tasting.objects.filter(
   event__in=Event.objects.filter(id__in=[1,2])
).distinct().values('wine_id').annotate(
   total_scores=total_subquery,
   average_scores=average_subquery,
)

So, initially, this looked correct:

[
    {'wine_id': 1, 'total_scores': 150, 'average_scores': 50},
    {'wine_id': 2, 'total_scores': 250, 'average_scores': 125},
    {'wine_id': 3, 'total_scores': 75, 'average_scores': 75},
]

Huzzah! BUT, what if we change the filter to only include Event 2:

Tasting.objects.filter(
   event__in=Event.objects.filter(id__in=[2])
).distinct().values('wine_id').annotate(
   total_scores=total_subquery,
   average_scores=average_subquery,
)

In this case, I still get back the data for ALL events. This makes intuitive sense, since the subqueries don't have any knowledge of the outer filter. However, if I change the OuterRef values on the subquery (to something like filter(pk=OuterRef('pk'))), then the proper grouping on the subqueries falls apart. If I re-add the event filtering at the subquery level, then we get the same duplicate row issue that we had in our first attempt.

I can get the proper values by simple fetching all the data and then doing the aggregation in Python, but this has serious performance costs for larger datasets. Is there a way to do this aggregation entirely through the ORM?

MBrizzle
  • 1,783
  • 1
  • 18
  • 31
  • Having the event being M2M doesn't make sense to me. – James Lin May 31 '20 at 09:03
  • First thought on why it may output different 'total_score' - same Tasting may be connected with multiple Events (ManyToMany relataion suggests this is possible) - thus different sum. Can you confirm or provide sample database table structure? – Oleg Russkin May 31 '20 at 11:33
  • @JamesLin The models in the question are just examples: the real data I'm dealing with is domain-specific and wouldn't make much sense removed from the context, so I made up some other models that mapped on to the structure. I also could have called the models `A`, `B`, and `C`, but I figured this was more readable. The model structure is not under consideration here. The question is that **given this database structure**, is it possible to do this aggregation in a single query? – MBrizzle Jun 01 '20 at 17:53
  • @OlegRusskin yes, you can see the sample data table I provided that the third tasting for Wine 1 is connected to two events (1 and 2). If you're not familiar with the Django ORM, the ManyToMany field will create an intermediary table called (roughly) "tastings_wine_through" that will have two columns: `wine_id` and `tasting_id`. You can extrapolate what the values would be based on my sample data above. I agree that it's the multiple linked events that are causing the duplicate sum, but the question is whether it's possible to get the proper annotations regardless of the number of events. – MBrizzle Jun 01 '20 at 18:00

1 Answers1

3

hope you're not too frustrated :)

What you need to do in this case is avoiding the join of the Events table to stop the double-count madness.

Wine.objects.filter(
    tasting__in=Tasting.event.through.objects.filter(event_id__in=[1, 2]).values('tasting_id')
).values('id').annotate(
    total_score=Sum('tasting__score'),
    average_scores=Avg('tasting__score')
)
[
{'id': 1, 'total_score': 150, 'average_scores': 50.0}, 
{'id': 2, 'total_score': 250, 'average_scores': 125.0}, 
{'id': 3, 'total_score': 75, 'average_scores': 75.0}
]
ACimander
  • 1,852
  • 13
  • 17