3

Background

Suppose we have a set of questions, and a set of students that answered these questions. The answers have been reviewed, and scores have been assigned, on some unknown range.

Now, we need to normalize the scores with respect to the extreme values within each question. For example, if question 1 has a minimum score of 4 and a maximum score of 12, those scores would be normalized to 0 and 1 respectively. Scores in between are interpolated linearly (as described e.g. in Normalization to bring in the range of [0,1]).

Then, for each student, we would like to know the mean of the normalized scores for all questions combined.

Minimal example

Here's a very naive minimal implementation, just to illustrate what we would like to achieve:

class Question(models.Model):
    pass


class Student(models.Model):
    def mean_normalized_score(self):
        normalized_scores = []
        for score in self.score_set.all():
            normalized_scores.append(score.normalized_value())
        return mean(normalized_scores) if normalized_scores else None


class Score(models.Model):
    student = models.ForeignKey(to=Student, on_delete=models.CASCADE)
    question = models.ForeignKey(to=Question, on_delete=models.CASCADE)
    value = models.FloatField()

    def normalized_value(self):
        limits = Score.objects.filter(question=self.question).aggregate(
            min=models.Min('value'), max=models.Max('value'))
        return (self.value - limits['min']) / (limits['max'] - limits['min'])

This works well, but it is quite inefficient in terms of database queries, etc.

Goal

Instead of the implementation above, I would prefer to offload the number-crunching on to the database.

What I've tried

Consider, for example, these two use cases:

  1. list the normalized_value for all Score objects
  2. list the mean_normalized_score for all Student objects

The first use case can be covered using window functions in a query, something like this:

w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])
annotated_scores = Score.objects.annotate(
    normalized_value=(F('value') - w_min) / (w_max - w_min))

This works nicely, so the Score.normalized_value() method from the example is no longer needed.

Now, I would like to do something similar for the second use case, to replace the Student.mean_normalized_score() method by a single database query.

The raw SQL could look something like this (for sqlite):

SELECT id, student_id, AVG(normalized_value) AS mean_normalized_score
FROM (
    SELECT
        myapp_score.*,
        ((myapp_score.value - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)) / (MAX(myapp_score.value) OVER (PARTITION BY myapp_score.question_id) - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)))
        AS normalized_value
    FROM myapp_score
    ) 
GROUP BY student_id

I can make this work as a raw Django query, but I have not yet been able to reproduce this query using Django's ORM.

I've tried building on the annotated_scores queryset described above, using Django's Subquery, annotate(), aggregate(), Prefetch, and combinations of those, but I must be making a mistake somewhere.

Probably the closest I've gotten is this:

subquery = Subquery(annotated_scores.values('normalized_value'))
Score.objects.values('student_id').annotate(mean=Avg(subquery))

But this is incorrect.

Could someone point me in the right direction, without resorting to raw queries?

djvg
  • 11,722
  • 5
  • 72
  • 103
  • Having a crack at this. It would probably help to calculate the mean from scores as a subquery, then grouping by students? `annotated_scores = Score.objects.filter(student=OuterRef('pk').annotate(normalized_value=(F('value') - w_min) / (w_max - w_min)).values('student').annotate(mean=Avg('normalized_value'))` and then from the Student, `Student.objects.annotate(mean_normalized_score=Subquery(annotated_scores.values('mean')[:1]))` – Brian Destura Jul 08 '21 at 07:34
  • @bdbd: Thanks for the attempt. :-) I think a closing bracket is missing from your `filter()`? Unfortunately, after fixing this, I get an `OperationalError: misuse of window function MIN()`. The same error arises e.g. if I try to `annotate()` my `annotated_scores`. – djvg Jul 08 '21 at 08:14
  • Just found some related questions, but no definite answer yet: https://stackoverflow.com/q/60864345, https://stackoverflow.com/q/38633076, https://stackoverflow.com/q/55638498 – djvg Jul 08 '21 at 08:50
  • related: https://stackoverflow.com/q/68289137 – djvg Feb 08 '23 at 11:12

2 Answers2

1

I may have found a way to do this using subqueries. The main thing is at least from django, we cannot use the window functions on aggregates, so that's what is blocking the calculation of the mean of the normalized values. I've added comments on the lines to explain what I'm trying to do:

# Get the minimum score per question
min_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(min=Min('value'))

# Get the maximum score per question
max_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(max=Max('value'))

# Calculate the normalized value per score, then get the average by grouping by students
mean_subquery = Score.objects.filter(student=OuterRef('pk')).annotate(
    min=Subquery(min_subquery.values('min')[:1]), 
    max=Subquery(max_subquery.values('max')[:1]), 
    normalized=ExpressionWrapper((F('value') - F('min'))/(F('max') - F('min')), output_field=FloatField())
).values('student').annotate(mean=Avg('normalized'))

# Get the calculated mean per student
Student.objects.annotate(mean=Subquery(mean_subquery.values('mean')[:1]))

The resulting SQL is:

SELECT 
  "student"."id", 
  "student"."name", 
  (
    SELECT 
      AVG(
        (
          (
            V0."value" - (
              SELECT 
                MIN(U0."value") AS "min" 
              FROM 
                "score" U0 
              WHERE 
                U0."question_id" = (V0."question_id") 
              GROUP BY 
                U0."question_id" 
              LIMIT 
                1
            )
          ) / (
            (
              SELECT 
                MAX(U0."value") AS "max" 
              FROM 
                "score" U0 
              WHERE 
                U0."question_id" = (V0."question_id") 
              GROUP BY 
                U0."question_id" 
              LIMIT 
                1
            ) - (
              SELECT 
                MIN(U0."value") AS "min" 
              FROM 
                "score" U0 
              WHERE 
                U0."question_id" = (V0."question_id") 
              GROUP BY 
                U0."question_id" 
              LIMIT 
                1
            )
          )
        )
      ) AS "mean" 
    FROM 
      "score" V0 
    WHERE 
      V0."student_id" = ("student"."id") 
    GROUP BY 
      V0."student_id" 
    LIMIT 
      1
  ) AS "mean" 
FROM 
  "student"
Brian Destura
  • 11,487
  • 3
  • 18
  • 34
  • 1
    Thanks! It looks like this workaround solves the immediate problem. I was hoping for a way to re-use the windowed queryset, as it is quite a bit easier on the mind, in my opinion. ;-) However, that does not appear to be an option at the moment: https://code.djangoproject.com/ticket/28333 – djvg Jul 09 '21 at 08:13
  • With a little refactoring, your min/max subqueries can be used as drop-in replacements for my min/max windows. I think this is probably the most convenient for now, so I'll accept your answer. Thanks again! :-) – djvg Jul 09 '21 at 12:49
1

As mentioned by @bdbd, and judging from this Django issue, it appears that annotating a windowed queryset is not yet possible (using Django 3.2).

As a temporary workaround, I refactored @bdbd's excellent Subquery solution as follows.

class ScoreQuerySet(models.QuerySet):
    def annotate_normalized(self):
        w_min = Subquery(self.filter(
            question=OuterRef('question')).values('question').annotate(
            min=Min('value')).values('min')[:1])
        w_max = Subquery(self.filter(
            question=OuterRef('question')).values('question').annotate(
            max=Max('value')).values('max')[:1])
        return self.annotate(normalized=(F('value') - w_min) / (w_max - w_min))

    def aggregate_student_mean(self):
        return self.annotate_normalized().values('student_id').annotate(
            mean=Avg('normalized'))


class Score(models.Model):
    objects = ScoreQuerySet.as_manager()
    ...

Note: If necessary, we can add more Student lookups to the values() in aggregate_student_mean(), e.g. student__name. As long as we take care not to mess up the grouping.

Now, if it ever becomes possible to annotate windowed querysets, we can simply replace the Subquery lines by the much simpler Window implementation:

w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])

EDIT: Django 4.2 now supports filtering windowed querysets, but still does not allow annotations.

djvg
  • 11,722
  • 5
  • 72
  • 103