I have three related tables:
- Modules
+-------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| module_id | int(11) | NO | PRI | NULL | |
+-------------+-------------+------+-----+---------+-------+
- Events
+------------------+--------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------------+--------------------------+------+-----+---------+----------------+
| event_id | int(11) | NO | PRI | NULL | auto_increment |
| event_time | datetime(4) | NO | | NULL | |
| module_id | int(11) | NO | MUL | NULL | |
| file_id | int(11) | YES | MUL | NULL | |
+------------------+--------------------------+------+-----+---------+----------------+
- Files
+--------------+-----------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+-----------------------------+------+-----+---------+----------------+
| file_id | int(11) | NO | PRI | NULL | auto_increment |
| path | varchar(512) | NO | UNI | NULL | |
+--------------+-----------------------------+------+-----+---------+----------------+
So, there are Modules, Events and Files. (Unused fields are trimmed from the tables to simplify).
Target: I want to get the latest event that happened on each module and it's file path.
What I tried: So, for this, at first I created a naive implementation on Django using subquery:
last_event_subquery = Event.objects.filter(
module_id__module_id=OuterRef('module__id')
).order_by('-event_time', '-event_id')
modules = Module.objects.all().annotate(
last_event_path=Subquery(last_event_subquery.values('file_id__path')[:1])
).annotate(
last_event_id=Subquery(last_event_subquery.values('event_id')[:1])
).annotate(
last_event_datetime=Subquery(last_event_subquery.values('event_time')[:1])
)
But, I discovered that running this over 1 million records in the events table is blatantly slow. Of course, there are several indexes over there to optimize all the thing, but even with that I couldn't find a combination of indexes that takes less than 5 seconds to run, which is too much imo. Then, I saw the reason, the equivalent SQL query is too silly:
SELECT `module`.`module_id`,
(SELECT U2.`path` FROM `events` U0 LEFT OUTER JOIN `files` U2 ON (U0.`file_id` = U2.`file_id`)
WHERE U0.`module_id` = (`modules`.`module_id`) ORDER BY U0.`event_time` DESC, U0.`event_id` DESC LIMIT 1)
AS `last_event_path`,
(SELECT U0.`event_id` FROM `events` U0
WHERE U0.`module_id` = (`modules`.`module_id`) ORDER BY U0.`event_time` DESC, U0.`event_id` DESC LIMIT 1)
AS `last_event_id`,
(SELECT U0.`event_time` FROM `events` U0
WHERE U0.`module_id` = (`modules`.`module_id`) ORDER BY U0.`event_time` DESC, U0.`event_id` DESC LIMIT 1)
AS `last_event_time` FROM `events`
As you can see, it's repeating the subquery three times.
So, I decided to give it a try at the best I could do in SQL, and I struggled to get the following working:
SELECT module.module_id,
events.event_id,
events.event_time,
files.path
FROM modules INNER JOIN events ON events.event_id =
(SELECT events.event_id FROM events
WHERE modules.module_id = events.module_id
ORDER BY events.event_time DESC, events.event_id DESC LIMIT 1)
INNER JOIN files ON files.file_id = events.file_id;
This runs in 0.001s. So, the problem now is that I can't get this done in the Django ORM language. Of course I could just place the raw SQL query and I would be done, but how would I live with such a disgrace?
I have investigated throughout the Django docs, struggled all over the stackoverflow questions, but I couldn't find the answer. The nearest I got was this, but the problem is that I can't get that limited to one result per module.
I have also tried the FilteredRelation, but couldn't get an appropriate filter. I can't use select_related() neither, because it's a reverse relation with a ForeignKey. I can't either use distinct() with a column field, because I'm using MySQL (more specifically, MariaDB version 10.3).
Do you have any advice on how to approach this?
Thank you!