5

I have three related tables:

  • Modules
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| module_id   | int(11)     | NO   | PRI | NULL    |       |
+-------------+-------------+------+-----+---------+-------+
  • Events
+------------------+--------------------------+------+-----+---------+----------------+
| Field            | Type                     | Null | Key | Default | Extra          |
+------------------+--------------------------+------+-----+---------+----------------+
| event_id         | int(11)                  | NO   | PRI | NULL    | auto_increment |
| event_time       | datetime(4)              | NO   |     | NULL    |                |
| module_id        | int(11)                  | NO   | MUL | NULL    |                |
| file_id          | int(11)                  | YES  | MUL | NULL    |                |
+------------------+--------------------------+------+-----+---------+----------------+
  • Files
+--------------+-----------------------------+------+-----+---------+----------------+
| Field        | Type                        | Null | Key | Default | Extra          |
+--------------+-----------------------------+------+-----+---------+----------------+
| file_id      | int(11)                     | NO   | PRI | NULL    | auto_increment |
| path         | varchar(512)                | NO   | UNI | NULL    |                |
+--------------+-----------------------------+------+-----+---------+----------------+

So, there are Modules, Events and Files. (Unused fields are trimmed from the tables to simplify).

Target: I want to get the latest event that happened on each module and it's file path.

What I tried: So, for this, at first I created a naive implementation on Django using subquery:

last_event_subquery = Event.objects.filter(
        module_id__module_id=OuterRef('module__id')
    ).order_by('-event_time', '-event_id')

modules = Module.objects.all().annotate(
        last_event_path=Subquery(last_event_subquery.values('file_id__path')[:1])
    ).annotate(
        last_event_id=Subquery(last_event_subquery.values('event_id')[:1])
    ).annotate(
        last_event_datetime=Subquery(last_event_subquery.values('event_time')[:1])
    )

But, I discovered that running this over 1 million records in the events table is blatantly slow. Of course, there are several indexes over there to optimize all the thing, but even with that I couldn't find a combination of indexes that takes less than 5 seconds to run, which is too much imo. Then, I saw the reason, the equivalent SQL query is too silly:

SELECT `module`.`module_id`,
       (SELECT U2.`path` FROM `events` U0 LEFT OUTER JOIN `files` U2 ON (U0.`file_id` = U2.`file_id`)
       WHERE U0.`module_id` = (`modules`.`module_id`) ORDER BY U0.`event_time` DESC, U0.`event_id` DESC  LIMIT 1)
       AS `last_event_path`,
       (SELECT U0.`event_id` FROM `events` U0
       WHERE U0.`module_id` = (`modules`.`module_id`) ORDER BY U0.`event_time` DESC, U0.`event_id` DESC  LIMIT 1)
       AS `last_event_id`,
       (SELECT U0.`event_time` FROM `events` U0
       WHERE U0.`module_id` = (`modules`.`module_id`) ORDER BY U0.`event_time` DESC, U0.`event_id` DESC  LIMIT 1)
       AS `last_event_time` FROM `events`

As you can see, it's repeating the subquery three times.

So, I decided to give it a try at the best I could do in SQL, and I struggled to get the following working:

SELECT module.module_id,
       events.event_id,
       events.event_time,
       files.path
       FROM modules INNER JOIN events ON events.event_id =
           (SELECT events.event_id FROM events
            WHERE modules.module_id = events.module_id
                ORDER BY events.event_time DESC, events.event_id DESC LIMIT 1)
       INNER JOIN files ON files.file_id = events.file_id;

This runs in 0.001s. So, the problem now is that I can't get this done in the Django ORM language. Of course I could just place the raw SQL query and I would be done, but how would I live with such a disgrace?

I have investigated throughout the Django docs, struggled all over the stackoverflow questions, but I couldn't find the answer. The nearest I got was this, but the problem is that I can't get that limited to one result per module.

I have also tried the FilteredRelation, but couldn't get an appropriate filter. I can't use select_related() neither, because it's a reverse relation with a ForeignKey. I can't either use distinct() with a column field, because I'm using MySQL (more specifically, MariaDB version 10.3).

Do you have any advice on how to approach this?

Thank you!

1 Answers1

0

So, I found the answer my self :)

It generates the exact same SQL I wanted, excepted for the order of the clauses, which doesn't affect the query at all. The key was to use the .filter() with relation ship. Using the subquery inside the filter() (which I didn't know was possible) made the trick. I was inspired by this answer.

last_event_subquery = Event.objects.filter(
    module_id__module_id=OuterRef('module_id')
).order_by('-event_time', '-event_id')
modules = Module.objects.filter(
    event__event_id=Subquery(
        last_event_subquery.values('module_id')[:1])
).values('id', 'event__event_id', 'event__event_time', 'event__file__path')

Which yields the following SQL:

SELECT `modules`.`module_id`,
       `events`.`event_id`,
       `events`.`event_time`,
       `files`.`path`
       FROM `modules`
       INNER JOIN `events` ON (`modules`.`module_id` = `events`.`module_id`)
       LEFT OUTER JOIN `files` ON (`events`.`file_id` = `files`.`file_id`)
       WHERE `events`.`event_id` =
            (SELECT U0.`event_id` FROM `events` U0
            WHERE U0.`module_id` = (`modules`.`module_id`)
                 ORDER BY U0.`event_time` DESC, U0.`event_id` DESC  LIMIT 1)

I hope this is useful for anyone else.