Getting top rows with the latest related row in joined table quickly

Question

There are two tables conversations and messages, I want to fetch conversations along with the content of their latest message.

conversations - id(PRIMARY KEY), name, created_at

messages - id, content, created_at, conversation_id

Currently we are running this query to get the required data

SELECT
    conversations.id,
    m.content AS last_message_content,
    m.created_at AS last_message_at
FROM
    conversations
INNER JOIN messages m ON conversations.id = m.conversation_id
                     AND m.id = (
    SELECT
        id
    FROM
        messages _m
    WHERE
        m.conversation_id = _m.conversation_id
    ORDER BY
        created_at DESC
    LIMIT 1)
ORDER BY
    last_message_at DESC
LIMIT 15
OFFSET 0

The above query is returning the valid data but its performance decreases with the increasing number of rows. Is there any other way to write this query with increased performance? Attaching the fiddle for example.

http://sqlfiddle.com/#!17/2decb/2

Also tried the suggestions in one of the deleted answers:

SELECT DISTINCT ON (c.id)
       c.id,
       m.content AS last_message_content,
       m.created_at AS last_message_at
  FROM conversations AS c
 INNER JOIN messages AS m
    ON c.id = m.conversation_id 
 ORDER BY c.id, m.created_at DESC
 LIMIT 15 OFFSET 0

http://sqlfiddle.com/#!17/2decb/5

But the problem with this query is it doesn't sort by m.created_at. I want the resultset to be sorted by m.created_at DESC

If the query you tried doesn't order the results as desired, just wrap it in another query `SELECT * FROM () AS unsorted ORDER BY last_message_at DESC` — MatBailie, Dec 05 '21 at 17:30
do you want other columns from conversations table too or just the id? Since id is stored in the messages table too so you only need to lookup one table, you won't need a join. — Salman A, Dec 07 '21 at 10:28
BTW, avoid sqlfiddle.com. It was cool some years ago, but it's outdated and offline half the time now. I suggest dbfiddle.uk instead. — Erwin Brandstetter, Dec 12 '21 at 21:17

score 5 · Answer 1 · edited Dec 11 '21 at 14:42

Since no column other than id from conversations is selected in the result you can only query the messages table to reduce time (Query1). If there is a possibility to have any messages for which conversation_id is not available in conversations table and you don't want to select those then you can use second query.

Schema and insert statements:

CREATE TABLE conversations 
(
    id INT, 
    name VARCHAR(200), 
    created_at DATE
);

INSERT INTO conversations VALUES (1, 'CONV1', '1 DEC 2021');
INSERT INTO conversations VALUES (2, 'CONV2', '1 DEC 2021');

CREATE TABLE messages 
(
    id INT, 
    content VARCHAR(200), 
    created_at DATE, 
    conversation_id INT
);

INSERT INTO messages VALUES (1, 'TEST 3', '12 DEC 2021', 1);
INSERT INTO messages VALUES (2, 'TEST 2', '11 DEC 2021', 1);
INSERT INTO messages VALUES (3, 'TEST 1', '10 DEC 2021', 1);
INSERT INTO messages VALUES (4, 'TEST CONV2 1', '10 DEC 2021', 2);

Query:

 WITH Latest_Conversation_Messages AS
 ( 
     SELECT
         conversation_id, content, created_at, 
         ROW_NUMBER() OVER (PARTITION BY conversation_id ORDER BY created_at DESC) rn 
     FROM
         messages
 )
 SELECT
     conversation_id, content AS last_message_content,
     created_at AS last_message_at 
 FROM
     Latest_Conversation_Messages 
 WHERE
     rn = 1

Output:

conversation_id	last_message_content	last_message_at
1	TEST 3	2021-12-12
2	TEST CONV2 1	2021-12-10

Query2:

WITH Latest_Conversation_Messages AS
( 
    SELECT
        conversation_id, content, created_at , 
        ROW_NUMBER() OVER (PARTITION BY conversation_id ORDER BY created_at DESC) rn 
    FROM
        messages m
    WHERE
        EXISTS (SELECT 1 FROM conversations c 
                WHERE c.id = m.conversation_id)
)
SELECT 
    conversation_id, content AS last_message_content,
    created_at AS last_message_at 
FROM
    Latest_Conversation_Messages 
WHERE
    rn = 1

Output:

conversation_id	last_message_content	last_message_at
1	TEST 3	2021-12-12
2	TEST CONV2 1	2021-12-10

db<>fiddle here

score 3 · Answer 2 · answered Dec 08 '21 at 08:52

3

Using not exists sub-statement ,

select
conversations.id,
m.content AS last_message_content,
m.created_at AS last_message_at
from conversations,messages m
where conversations.id = m.conversation_id 
and not exists (select 1 from messages n 
where m.conversation_id = n.conversation_id 
and m.created_at < n.created_at)

create an index on conversation_id of table messages would be good.

answered Dec 08 '21 at 08:52

ElapsedSoul

725
6
18

May be it is `>` in `not exists` sub-statement, you can test. It's 50% for wrong. But I check it in my mind,you can check it in database. – ElapsedSoul Dec 08 '21 at 08:55

dshelya · Answer 3 · 2021-12-10T01:44:00.253

Before measuring performance and comparing explains from different queries it usually worth to try and mimic your production setup first or you may (and almost certainly will) get a misleading EXPLAIN path (e.g. seq scan is used instead of index scan on small tables as sequential will be faster than random IO in such case)

I tried to tackle it in this way:

First - generate 200K conversations within the past 30 days

insert into conversations(id, name, created_at)
select 
    generate_series, 
    'CONVERSATION_'||generate_series, 
    NOW() - (random() * (interval '30 days')) 
from generate_series(1, 200000);

Second - generate 2M messages randomly distributed among 200K conversations and then also intentionally created 5 more "most recent" messages for conversation with ID=999, so that the conversation 999 has to always appear on top of query result.

insert into messages(id, content, conversation_id, created_at)
select msg.id, content, conversation_id, created_at + (random() * (interval '7 days')) from (
    select distinct
        generate_series as id, 
        'Message content ' || generate_series as content,
        1 + trunc(random() * 200000) as conversation_id
    from generate_series(1, 2000000)
) msg
join conversations c on c.id = conversation_id;

insert into messages(id, content, conversation_id, created_at)
select 
    generate_series as id, 
    'Message content ' || generate_series as content,
    999 as conversation_id,
    now() + interval '7 day' + (random() * (interval '7 days'))
from generate_series(2000001, 2000006);

And now you can try and compare (now with a little bit more confidence) those EXPLAINs to see which query works better.

Assuming you added the proposed index

CREATE INDEX idx1 ON messages(conversation_id, created_at desc)

Both @GoonerForLife and @asinkxcoswt answers are pretty good, though the result is moderate because of window function usage

with cost=250000 on average and 2 to 3 seconds execution time on my machine

@SalmanA and @ESG answers are twice as fast even though the lateral join will force query planner to choose sequential scan (this is inevitable as the join is on TRUE, so no index could be used)

with cost ~150000 on average and 1-1.5 seconds execution time on my machine

It may not be obvious at first, but the @ElapsedSoul's answers with NOT EXISTS is quite close to ideal (though it still needs a couple of tweaks):

(1) It lacks the order by and limit:

select 
    conversations.id, 
    m.content AS last_message_content, 
    m.created_at AS last_message_at
from conversations,messages m
where conversations.id = m.conversation_id and not exists (
    select 1 from messages n 
    where m.conversation_id = n.conversation_id and m.created_at < n.created_at
) order by last_message_at desc
limit 15

And (2) Since there is date-to-date comparison inside the NOT EXISTS subquery - we have to add another index on massages table

CREATE INDEX ix2 ON messages(created_at desc);

After that we should get a decent performance gain. For example on my machine it resulted in 0.036ms execution time and 20.07 cost

score 2 · Answer 4 · answered Dec 12 '21 at 02:33

12 answers already. (!) And yet, here is number 13 with a query to outperform all of them.

All you need is an index on messages.created_at for this to be fast. Like:

CREATE INDEX ON messages (created_at DESC);

Works with ascending order, too, almost at the same speed

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT conversation_id, content, created_at, ARRAY[conversation_id] AS latest_ids
   FROM   messages
   ORDER  BY created_at DESC
   LIMIT  1
   )
   UNION ALL
   SELECT m.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT m.conversation_id, m.content, m.created_at, c.latest_ids || m.conversation_id
      FROM   messages m
      WHERE  m.created_at <= c.created_at
      AND    m.conversation_id <> ALL (c.latest_ids)
      ORDER  BY m.created_at DESC
      LIMIT  1
      ) m
   -- WHERE  cardinality(c.latest_ids) < 15  -- not necessary
   )
SELECT conversation_id
     , content    AS last_message_content
     , created_at AS last_message_at
FROM   cte
LIMIT  15;

db<>fiddle here

Why? How?

Key is your small LIMIT, taking only the 15 "latest" conversations. (The ones with the latest messages, to be precise.) So very few. There are better query styles to get all or most conversations with their respective latest message.

Your table is big (you mentioned an "increasing number of rows") so the all-essential virtue must be to avoid a sequential scan over the big table - or even an index-scan over the whole table. That's what this query achieves.

The query emulates an index-skip scan with a recursive CTE. The non-recursive part takes the latest message, the recursive part takes the next-latest message from a different conversation. latest_ids keeps track of the selected conversations to avoid dupes.

Postgres will stop recursion as soon as enough rows have been selected to meet the outer LIMIT. So the added, but commented, break condition in the recursive term is not needed.

Unless there are too many messages in each conversation, so that a large amount of rows would have to be filtered (seems extremely unlikely), this should be as good as it gets.

See:

Oh, and we don't need to involve the table conversations at all. We get conversation_id from table messages and that's all you want.

score 1 · Accepted Answer · answered Dec 05 '21 at 17:32

Have you tried a lateral join instead?

SELECT
    conversations.id,
    m.content AS last_message_content,
    m.created_at AS last_message_at
FROM "conversations" 
INNER JOIN LATERAL (
  SELECT content, created_at 
  FROM  messages m
  WHERE conversations.id = m.conversation_id 
  ORDER BY created_at DESC
  FETCH FIRST 1 ROW ONLY
) m ON TRUE
ORDER BY last_message_at DESC
LIMIT 15 OFFSET 0

asinkxcoswt · Answer 6 · 2021-12-07T12:01:34.833

Another version worth trying

SELECT *
FROM (
SELECT
    cv.id,
    ms.content AS last_message_content,
    ms.created_at AS last_message_at,
    row_number() over (partition by cv.id order by ms.created_at desc) rank
FROM conversations cv
JOIN messages ms on (cv.id = ms.conversation_id)
) t
WHERE t.rank = 1
ORDER BY t.last_message_at DESC
LIMIT 15 OFFSET 0;

Since we only need the first n (15) conversations, if the database optimizer fails to figure out this (need to test against the actual data to see), it will cause full table scan.

To help the optimizer, we can tell it to select first n rows from conversations before joining.

WITH cv as (
SELECT id
, created_at
from conversations
ORDER BY created_at DESC
LIMIT 15 OFFSET 0
)
SELECT *
FROM (
SELECT
    cv.id,
    ms.content AS last_message_content,
    ms.created_at AS last_message_at,
    row_number() over (partition by cv.id order by ms.created_at desc) rank
FROM cv
JOIN messages ms on (cv.id = ms.conversation_id)
) t
WHERE t.rank = 1
ORDER BY t.last_message_at DESC
;

But with limitation that you will get the latest conversations instead of the latest messages. To fix this, you can add last_updated_at to the conversations.

Please note that an index on conversations.created_at (or last_updated_at if you want) is a critical factor here, please don't fotget to have one.

A `limit 15` is also needed – Salman A Dec 07 '21 at 11:17 — Salman A, Dec 07 '21 at 11:17
@SalmanA as you wish, ;p – asinkxcoswt Dec 07 '21 at 11:23 — asinkxcoswt, Dec 07 '21 at 11:23

GoonerForLife · Answer 7 · 2021-12-05T18:29:56.977

1

Try this one.

WITH cte AS
(
    SELECT
        conversations.id,
        m.content AS last_message_content,
        m.created_at AS last_message_at,
        MAX(M.created_at) OVER(PARTITION BY conversations.id) = M.created_at AS isLatest
    FROM conversations 
    INNER JOIN messages m 
      ON conversations.id = m.conversation_id
)
SELECT ID, last_message_content, last_message_at
FROM cte 
WHERE isLatest
ORDER BY last_message_at DESC
LIMIT 15 OFFSET 0

edited Dec 05 '21 at 18:29

answered Dec 05 '21 at 18:03

GoonerForLife

631
2
5

1

Never infer creation order from an identity column. There are a bazillion reasons this might not be the case. Identity columns are there for uniqueness only, and nothing else, ever. Just test against the created_at column. – MatBailie Dec 05 '21 at 18:17
@MatBailie - Agreed!. Updated the query to use created_at field instead of the ID. – GoonerForLife Dec 05 '21 at 18:32

score 1 · Answer 8 · answered Dec 06 '21 at 01:43

1

I second the answer with lateral view, and I may suggest a Transact-SQL variant with CROSS-APPLY

SELECT
    conversations.id,
    m.content AS last_message_content,
    m.created_at AS last_message_at
FROM "conversations" 
outer apply (
  SELECT top 1 content, created_at 
  FROM  messages m
  WHERE conversations.id = m.conversation_id 
  ORDER BY created_at DESC
) m
 ORDER BY
    last_message_at DESC
LIMIT 15 OFFSET 0

answered Dec 06 '21 at 01:43

Igor N.

394
2
3

2

This doesn't apply to postgres DB – Rajdeep Singh Dec 07 '21 at 21:49

procopypaster · Answer 9 · 2021-12-09T02:45:00.147

Bringing together a few of the different ideas in here together, the select distinct on with a subquery for sort, and using only the messages table since it houses all the required information.

SELECT
conversation_id AS id,
content AS last_message_content,
created_at AS last_message_at
  FROM
     (SELECT DISTINCT ON (conversation_id)
         conversation_id,
         content,
         created_at
       FROM messages
       ORDER BY conversation_id, created_at DESC
       LIMIT 15 OFFSET 0) 
       as a
ORDER BY created_at DESC

http://sqlfiddle.com/#!17/2decb/134/0 - cost of 16.9

I'm not sure how big the actual dataset it, and while a partition as used by others may perform well on the sqlfiddle available it could cause some longer run times over a larger set.

You could also set up the above as a CTE, depending on preference -

WITH SUBQUERY AS
(SELECT DISTINCT ON (conversation_id)
   conversation_id,
   content,
   created_at
  FROM messages
  ORDER BY conversation_id, created_at DESC
  LIMIT 15 OFFSET 0)

SELECT
  conversation_id AS id,
  content AS last_message_content,
  created_at AS last_message_at
FROM SUBQUERY
ORDER BY created_at DESC

http://sqlfiddle.com/#!17/2decb/135/0 - cost of 17

If the join to conversations is required, this would work - http://sqlfiddle.com/#!17/2decb/144

score 0 · Answer 10 · answered Dec 07 '21 at 06:03

Why complicate stuff? You can simply use:

SELECT
    m.conversation_id,
    m.content AS last_message_content
    m.created_at AS last_message_at
FROM
    messages m  
where created_at in (select max(created_at) from messages m2 group by m2.conversation_id)

score 0 · Answer 11 · answered Dec 08 '21 at 08:18

Window functions are generally more performant based on my experience.

You may try this one and please let us know how much performance you've gain http://sqlfiddle.com/#!17/2decb/131/0

SELECT id, last_message_content, last_message_at
FROM (SELECT
    conversations.id,
    messages.content AS last_message_content,
    messages.created_at AS last_message_at,
    ROW_NUMBER() OVER (PARTITION BY conversations.id ORDER BY messages.created_at DESC) as num
FROM conversations INNER JOIN messages ON conversations.id = messages.conversation_id) m
WHERE m.num = 1
ORDER BY last_message_at DESC

It appears that it's miles away faster in comparison to one of the fiddle you've provided: http://sqlfiddle.com/#!17/2decb/133/0 with cost of 1680.

With the window function, it's http://sqlfiddle.com/#!17/2decb/132/0 just 41.

Getting top rows with the latest related row in joined table quickly

11 Answers11

Why? How?