0

I have a model like this (simplified):

class Events(models.Model):
    reference_date = models.DateField()
    event = models.IntegerField(db_index=True)
    created_at = models.DateTimeField()
    updated_at = models.DateTimeField()

    class Meta:
        unique_together = (('reference_date', 'event'),)

I can have the same event with multiple reference dates, but I only have one event per reference_date. The table data looks something like this:

id reference_date event created_at updated_at
1 2022-01-01 12345 2022-03-05 18:18:03 2022-03-06 18:12:09
2 2022-01-02 12345 2022-03-08 08:05:11 2022-03-08 08:05:55
3 2022-01-08 12345 2022-06-15 18:18:12 2022-06-16 02:23:11
4 2022-01-01 98765 2022-01-11 07:55:25 2022-01-13 08:45:12
5 2022-01-02 98765 2022-06-22 10:25:08 2022-07-05 18:55:08
6 2022-01-09 45678 2022-02-19 12:55:07 2022-04-16 12:21:05
7 2022-01-10 45678 2022-03-05 11:23:45 2022-03-05 18:55:03

I need the latest record for each event. But I need all the event attributes, not just the max(reference_date)

I'm looking for this result:

[
{'id': 3, 'event': 12345, 'reference_date': '2022-01-08', 'created_at': '2022-06-15 18:18:12', 'updated_at': '2022-06-16 02:23:11'},
{'id': 5, 'event': 98765, 'reference_date': '2022-01-02', 'created_at': '2022-06-22 10:25:08', 'updated_at': '2022-07-05 18:55:08'},
{'id': 7, 'event': 45678, 'reference_date': '2022-01-10', 'created_at': '2022-03-05 11:23:45', 'updated_at': '2022-03-05 18:55:03'}
]

From a 'sql-perspective' I could get the results in multiple ways: SUBQUERIES, ROW_NUMBER, CORRELATED SUBQUERY etc. In this particular case for clarity reasons I prefer to use a join with itself using an aggregate inside the subquery.

If I'd write a raw query, I'd would do like this:

SELECT
 e.*
FROM events as e
INNER JOIN (
        SELECT
         event,
         max(reference_date) as reference_date
        FROM events
        GROUP BY event) AS b
ON b.reference_date = e.reference_date AND b.event = e.event

I think this query is very straightforward and has a good performance for the data volume in this table (hundreds of thousands of records)

I looked for several ways to build this type of query (Subquery, PrefetchRelated etc) but I couldn't find any proper way to translate this query into Django ORM syntax.

Is there a way using the 'django-orm way' without a convoluted ORM solution with the same reasonable performance?

ps: I need this query (or similar ones) for different database engines (PostgreSql, MySQL, MSQLServer ..)

Sunderam Dubey
  • 1
  • 11
  • 20
  • 40
Henrique
  • 1
  • 2

1 Answers1

0

I am not 100% certain this will work, mostly because I mysel haven't had a need (yet) to use this with Django's ORM, but something like this might work. You need to import Max from django.db.models.

I will post this and read up on the docs about .aggregate() - read up and still not an expert :).

from django.db.models import Max
#Events.objects.all() -> get all the objects from Events
#.group_by("event") tells them to be grouped by the "event" column
#.aggregate(Max("reference_data")) is what I am uncertain about...
Events.objects.all().group_by("event").aggregate(Max("reference_date"))

There is also the option to write your raw statement for each db engine and use the .raw(query) method on your model to query the database. Since you need support for multiple databases, you could also use this to determine what database you are using to run the query.

I also came across this article that influenced the suggestion below:

Events.objects.values('event').annotate(Max('reference_date'))

Lastly, A SO Post with relatively the same question mentions using select_related() to achieve JOINs.

Shmack
  • 1,933
  • 2
  • 18
  • 23
  • Your suggestion only returns the last `reference_date` for each `event`: `['event': 12345, 'reference_date': '2022-01-08'}]`. I need to use this result queryset within a subquery to return all attributes that match those tuples `(event, reference_date)`. I only mentioned that i'm using more than one database to avoid 'PostgreSql only' solutions, because I need those type of queries in different projects with different databases – Henrique Sep 19 '22 at 18:34
  • But its returning a query set of more than 1 event, correct? And I am assuming you mean the first suggestion? – Shmack Sep 19 '22 at 18:39
  • Ah, I see... I think I know part of the problem... Do `Events.objects.annotate(Max('reference_date'))` instead. – Shmack Sep 19 '22 at 18:49
  • I'm sorry. This one: `Events.objects.all().group_by("event").aggregate(Max("reference_date"))` doesn't work: `'QuerySet' object has no attribute 'group_by'` I was talking about this one: `Events.objects.values('event').annotate(Max('reference_date'))` It returns the maximum date (last `reference_date`) for each `event`. It would be the same raw query as : `SELECT event, max(reference_date) as reference_date FROM events GROUP BY event)`. But I need not only the `date` and the `event`, but also the entire set of attributes of this record `id, created_at, updated_at` – Henrique Sep 19 '22 at 18:55
  • One of the links that you provided answers the aggregation part of the problem. This part is relatively easy to solve in Django. When I 'annotate' a queryset and use an aggregation function (Max, Min, etc) the ORM translate this directly to a grouped set. The other link is about the ORM JOIN, but doesn't really touch the join with subquery problem – Henrique Sep 19 '22 at 19:06
  • 1st response comment - I'm with you. I think .values() selects the fields, so run it without `.values` or use `.all()`, but with either situation use .annotate, so `Events.objects.all().annotate(Max('reference_date'))` or `Events.objects.annotate(Max('reference_date'))` and (I think) a `.group_by("reference_data")`. 2nd response comment - you could write the subquery in django ORM then run an additional query, but I was uncertain as to how that would affect speed - so I dodged that suggestion in my answer. Thoughts from above? – Shmack Sep 19 '22 at 19:13
  • In addition to above: Correction - values() returns a dictionary of the objects and attributes as opposed to objects, however, passing it args will return those fields (as a dictionary). – Shmack Sep 19 '22 at 19:18
  • [1] The first part just groups the events per max reference_date with or without `all()` or `values()`. They return a queryset with the tuples (event , reference_date__max), but the other attributes stay unreachable. [2] My solution today is to use two queries. One to get those pairs (event and reference_date__max) and another query with a big filtered OR and ANDs, similar to this solution (https://stackoverflow.com/questions/20177749/django-filter-queryset-on-tuples-of-values-for-multiple-columns) but I'm trying to find a more "elegant" solution with the Django ORM – Henrique Sep 19 '22 at 19:50
  • Well, I tried. I did a search in the django orm docs for "inner" and "join" and (though I'm not certain if its the same join), they are making 2 queries too in their examples. – Shmack Sep 19 '22 at 20:47