Here is a quick performance comparison for the queries mention in this post.
Current setup :
The table core_message
has 10,904,283 rows and there is 60,740 rows in test_boats
(or 60,740 distinct mmsi in core_message
).
And I'm using PostgreSQL 11.5
Query using index-only scan :
1) using DISTINCT ON
:
SELECT DISTINCT ON (mmsi) mmsi
FROM core_message;
2) using RECURSIVE
with LATERAL
:
WITH RECURSIVE cte AS (
(
SELECT mmsi
FROM core_message
ORDER BY mmsi
LIMIT 1
)
UNION ALL
SELECT m.*
FROM cte c
CROSS JOIN LATERAL (
SELECT mmsi
FROM core_message
WHERE mmsi > c.mmsi
ORDER BY mmsi
LIMIT 1
) m
)
TABLE cte;
3) Using an extra table with LATERAL
:
SELECT a.mmsi
FROM test_boats a
CROSS JOIN LATERAL(
SELECT b.time
FROM core_message b
WHERE a.mmsi = b.mmsi
ORDER BY b.time DESC
LIMIT 1
) b;
Query not using index-only scan :
4) using DISTINCT ON
with mmsi,time DESC
INDEX
:
SELECT DISTINCT ON (mmsi) *
FROM core_message
ORDER BY mmsi, time desc;
5) using DISTINCT ON
with backward mmsi,time
UNIQUE CONSTRAINT
:
SELECT DISTINCT ON (mmsi) *
FROM core_message
ORDER BY mmsi desc, time desc;
6) using RECURSIVE
with LATERAL
and mmsi,time DESC
INDEX
:
WITH RECURSIVE cte AS (
(
SELECT *
FROM core_message
ORDER BY mmsi , time DESC
LIMIT 1
)
UNION ALL
SELECT m.*
FROM cte c
CROSS JOIN LATERAL (
SELECT *
FROM core_message
WHERE mmsi > c.mmsi
ORDER BY mmsi , time DESC
LIMIT 1
) m
)
TABLE cte;
7) using RECURSIVE
with LATERAL
and backward mmsi,time
UNIQUE CONSTRAINT
:
WITH RECURSIVE cte AS (
(
SELECT *
FROM core_message
ORDER BY mmsi DESC , time DESC
LIMIT 1
)
UNION ALL
SELECT m.*
FROM cte c
CROSS JOIN LATERAL (
SELECT *
FROM core_message
WHERE mmsi < c.mmsi
ORDER BY mmsi DESC , time DESC
LIMIT 1
) m
)
TABLE cte;
8) Using an extra table with LATERAL
:
SELECT b.*
FROM test_boats a
CROSS JOIN LATERAL(
SELECT b.*
FROM core_message b
WHERE a.mmsi = b.mmsi
ORDER BY b.time DESC
LIMIT 1
) b;
Using a dedicated table for the last message:
9) Here is my initial solution, using a distinct table with only the last message. This table is populated as new messages arrive but could also be created like so :
CREATE TABLE core_shipinfos AS (
WITH RECURSIVE cte AS (
(
SELECT *
FROM core_message
ORDER BY mmsi DESC , time DESC
LIMIT 1
)
UNION ALL
SELECT m.*
FROM cte c
CROSS JOIN LATERAL (
SELECT *
FROM core_message
WHERE mmsi < c.mmsi
ORDER BY mmsi DESC , time DESC
LIMIT 1
) m
)
TABLE cte);
Then the request to get the latest message is as simple as that :
SELECT * FROM core_shipinfos;
Results :
Average of multiple query (around 5 for the fast one):
1) 9146 ms
2) 728 ms
3) 498 ms
4) 51488 ms
5) 54764 ms
6) 729 ms
7) 778 ms
8) 516 ms
9) 15 ms
Conclusion:
I won't comment on the dedicated table solution, and will keep that for the end.
The additional table (test_boats
) solution is definitely the winner here but the RECURSIVE
solution is also pretty efficient.
There is a huge gap in performance for the DISTINCT ON
using index-only scan and the one not using it but, the performance gain is rather small for the other efficient query.
This makes sense as the major improvement those queries bring is the fact that they don't need to loop over the whole core_message
table but only on a subset of the unique mmsi
that is significantly smaller (60K+) in comparison to the core_message
table size (10M+)
As an additional note, there does not seem to be significant improvement in performance for the queries using the UNIQUE CONSTRAINT
if I drop the mmsi,time DESC
INDEX
. But dropping that index will of course save me some space (this index currently take 328MB)
About the dedicated table solution:
Each messages stored in the core_message
table carries both positional informations (position, speed, heading,etc.) AND ship informations (name, callsign, dimensions, etc.), as well as ship identifier (mmsi).
To give a bit more background on what I'm actually trying to do : I'm implementing a backend to store messages emitted by ships via the AIS protocol.
As such, every unique mmsi I got, I got it via this protocol. It is not a pre-defined list. It keeps adding new MMSI until I got every ships in the world using AIS.
In that context, a dedicated table with ship informations as last message received makes sense.
I could avoid using such a table as we've seen with the RECURSIVE
solution, but... a dedicated table is still 50x faster than this RECURSIVE
solution.
That dedicated table is in fact similar to the test_boat
table, with more information than just the mmsi
field. As it is, having a table with mmsi
only field or a table with every last informations of the core_message
table add the same complexity to my application.
In the end, I think I will go for this dedicated table. It will gives me unbeatable speed and I'll still have the possibility to use the LATERAL
trick on core_message
, which will gives me more flexibility.