1

The following is the Message model

class Message < ApplicationRecord
  belongs_to :parent_message, class_name: 'Message', optional: true
  has_many :child_messages, foreign_key: :parent_message_id, class_name: "Message"
  has_many :message_participants

  scope :latest_messages_by_participant, -> (user_id) do
    select("DISTINCT ON (parent_message_id) messages.*").
        joins(:message_participants).
        where(message_participants: { user_id: user_id }).
        order("parent_message_id, created_at DESC")
  end
end

message_participants has a record for each message and the various people who have sent or received that message. It has a user_id on it.

The problem with the above latest_messages_by_participant scope is that it is able to get all the child messages BUT it gets only the last parent message. That is because we are calling DISINTICT ON on parent_message_id, and for childless parent messages this value is NULL and so it's just calling distinct on NULL and returns on 1 value (the last childless parent message).

How can I fetch all the latest messages including the latest child messages AND the latest childless parent message in a single query?

I'm using Rails 6 and Postgres 11.

P.S: I should also point out a secondary problem which is that the messages are returned in created_at ASC. The created_at DESC is able to get the latest child message but does not sort the overall collection. I can solve this by calling .reverse, but wondering if there was a way to fix that as well.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Swamy g
  • 2,106
  • 4
  • 23
  • 35
  • Actual table definitions (`CREATE TABLE` statements) and some sample data would be instrumental. A plain SQL statement would also hepl. – Erwin Brandstetter May 11 '20 at 17:59
  • 1
    If you care about *performance*, we need some more information. Consider instructions here: https://stackoverflow.com/tags/postgresql-performance/info. Most importantly: roughly how many rows per user and message id on avg? – Erwin Brandstetter May 11 '20 at 19:36
  • Thank you @ErwinBrandstetter, for now, performance is not a consideration, but will definitely evaluate when things change. – Swamy g May 11 '20 at 23:10

2 Answers2

2

I believe you need to add a coalese in your distinct on & order by in order to pick the message's id when the parent_message_id is null.

select("DISTINCT ON (parent_message_id) messages.*")
...
order("parent_message_id, created_at DESC")

needs to be transformed to

select("DISTINCT ON (COALESCE(parent_message_id, messages.id)) messages.*")
...
order("COALESCE(parent_message_id, messages.id), created_at DESC")

Now, you have not provided the sample database tables & expected or a full model definition, so I'm inferring quite a few things. Here are the minimum table definitions (as I understood them), the raw sql query that would be generated by AR after my suggested modification [this is the query we want, given the schema below] & the results.

Setup

CREATE TABLE messages (
  id int primary key
, parent_message_id int references messages(id)
, created_at timestamp default current_timestamp
); 
INSERT INTO messages (id, parent_message_id) values 
  (1, NULL) -- parent message with children
, (2, 1)
, (3, 1)
, (4, NULL) -- parent message without children
, (5, NULL) -- another parent message without children
;
CREATE TABLE message_participants (
  user_id int
, message_id int references messages(id)
)
INSERT INTO message_participants values (1, 1), (2, 2), (3, 3), (1, 4), (2, 5);

RAW SQL query that gives us the last parent or child message:

SELECT DISTINCT ON (COALESCE(parent_message_id, messages.id)) messages.*
FROM messages
JOIN message_participants ON message_participants.message_id = messages.id
WHERE message_participants.user_id = ? -- replace by user_id
ORDER BY COALESCE(parent_message_id, messages.id), created_at DESC

Results

Given user_id = 1, the query above returns the result:

 id | parent_message_id |         created_at
----+-------------------+----------------------------
  1 |                   | 2020-05-11 13:50:00.857589
  4 |                   | 2020-05-11 13:50:00.857589
(2 rows)

Given user_id = 2, the query above returns the result:

 id | parent_message_id |         created_at
----+-------------------+----------------------------
  2 |                 1 | 2020-05-11 13:50:00.857589
  5 |                   | 2020-05-11 13:52:01.261975
(2 rows)

Sorting the overall results:

The created_at DESC is able to get the latest child message but does not sort the overall collection. I can solve this by calling .reverse, but wondering if there was a way to fix that as well.

To do the sorting in the database, you can wrap the above query in a cte

example:

WITH last_messages AS (
SELECT DISTINCT ON (COALESCE(parent_message_id, messages.id)) messages.*
FROM messages
JOIN message_participants ON message_participants.message_id = messages.id
WHERE message_participants.user_id = 2
ORDER BY COALESCE(parent_message_id, messages.id), created_at DESC
)
SELECT * FROM last_messages ORDER BY created_at;

However, I'm not 100% sure how that would be expressed in AR

Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85
2

Use a COALESCE expression in DISTINCT ON and ORDER BY.
And sort the result in an outer query to get your desired sort order:

SELECT *
FROM  (
   SELECT DISTINCT ON (COALESCE(m.parent_message_id, m.id))
          m.*
   FROM   messages m
   JOIN   message_participants mp ON ...
   WHERE  mp.user_id = ...
   ORDER  BY (COALESCE(m.parent_message_id, m.id)), created_at DESC
   )
ORDER  BY created_at;

See (with detailed explanation):

Performance?

For few rows per user and message ID, DISTINCT ON is typically among the fastest possible solutions. For many rows, there are (much) faster ways. Depends on more information, as commented.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228