1

I have a model for which I want to perform a group-by on two values and calculate the percentages of each value per outer grouping.

Currently I just make a query to get all the rows and put them into a pandas dataframe and perform something similar to the answer here. Although this works I'm sure it would be more efficient if I could make the query return the information I require directly.

I am currently running Django 2.0.5 with a backend DB on PostgreSQL 9.6.8

I think window functions could be the solution as indicated here but I cannot construct a successful combination of annotate and values to give me the desired output.

Another possible solution could be rollup introduced in PostgreSQL 9.5 if I can find a way to get the summary row as a set of extra columns for each row? But I also think it's not yet supported by Django.

Model:

class ModelA(models.Model):
    grouper1 = models.CharField()
    grouper2 = models.CharField()
    metric1 = models.IntegerField()

All rows:

grouper1 | grouper2 | metric1
---------+----------+---------
   A     |    C     |    2   
   A     |    C     |    2   
   A     |    C     |    2   
   A     |    D     |    4   
   A     |    D     |    4   
   A     |    D     |    4   
   B     |    C     |    5   
   B     |    C     |    5   
   B     |    C     |    5   
   B     |    D     |    6   
   B     |    D     |    4   
   B     |    D     |    5   

Desired output:

grouper1 | grouper2 | sum(metric1) | Percentage 
---------+----------+--------------+-----------
   A     |    C     |    6         |    40
   A     |    D     |    12        |    60
   B     |    C     |    15        |    50
   B     |    D     |    15        |    50

I got close to what I expected with

ModelA.objects.all(
    ).values(
        'grouper1', 
        'grouper2'
    ).annotate(
        SumMetric1=Window(expression=Sum('metric1'), partition_by=[F('grouper1'), F('grouper2')]), 
        GroupSumMetric1=Window(expression=Sum('metric1'), partition_by=[F('grouper1')])
)

However this returns a row for every original row in the database like so:

grouper1 | grouper2 | sum(metric1) | Percentage 
---------+----------+--------------+-----------
   A     |    C     |    6         |    40
   A     |    C     |    6         |    40
   A     |    C     |    6         |    40
   A     |    D     |    12        |    60
   A     |    D     |    12        |    60
   A     |    D     |    12        |    60
   B     |    C     |    15        |    50
   B     |    C     |    15        |    50
   B     |    C     |    15        |    50
   B     |    C     |    15        |    50
   B     |    C     |    15        |    50
   B     |    D     |    15        |    50 
PyPingu
  • 1,697
  • 1
  • 8
  • 21
  • 1
    did you try `.distinct()`? like discribed here - https://stackoverflow.com/a/33104968/8495108 – Chiefir May 03 '18 at 10:10
  • The obvious solution is to use `.distinct()`, indeed it gives the desired output, however it was my understanding that performing a `.values()` call followed by `annotate` or `aggregate` should perform a groupby? Also how efficient is `distinct`? If the number of metrics being calculated was 10-20 rather than 1 then will it be much slower (effectively calling `distinct` across 20 columns)? – PyPingu May 03 '18 at 10:31
  • I can't tell you how exactly fast this is, but as I understand you correctly - u are asking about `A | C | 6 | 40 | and 20+ columns` case? But `.distinct()` works on rows, not columns, so you might have any amount of metrics, as you wish. – Chiefir May 03 '18 at 10:36
  • It works on rows yes, but obviously the more columns in each row (which are required to be distinct), then the more computational overhead I assume? – PyPingu May 03 '18 at 10:39
  • 1
    yep, but `.annotated()` values - are not values from the database :) these are values, calculated by Python and "glued" to the Queryset. or you can try provide additional arguments `.distinct('grouper1', 'grouper2')`, but I dont know it will have sense here. And more to that - `distinct` runtime will be *much* faster, than calculating 20+ additional expressions. – Chiefir May 03 '18 at 10:44
  • did that solve your problem? – Chiefir May 03 '18 at 11:53
  • Yes `distinct()` provides what I want and performance doesn't seem to too impacted by an increase in columns - as you stated. I would dispute that `annotated` values are generated in python though, they are generated by SQL which is itself generated by the python. – PyPingu May 03 '18 at 14:58
  • as I understand, `annotate` by itself does not send any additional SQL - this may be possible if u use some expressions, like `F` there. But if u make something like `.annotate(site = "google.com") - this might not send any additional SQL. P.S. I will post the answer, accept it please. – Chiefir May 03 '18 at 15:02

1 Answers1

1

In this situation .distinct() might help.
More information is here.

Chiefir
  • 2,561
  • 1
  • 27
  • 46