To return a particular row from each group, you need to add a deterministic ORDER BY
clause. Basically:
SELECT DISTINCT ON (q.id)
q.id AS question_id
, q.title AS question_title
, q.created_at AS question_created_at
, q.updated_at AS question_updated_at
, a.id AS answer_id
, a.content AS answer_content
, a.created_at AS answer_created_at
, a.updated_at AS answer_updated_at
, SUM(v.value) AS votes
FROM questions q
LEFT JOIN answers a ON q.id = a.question_id
LEFT JOIN votes v ON a.id = v.answer_id
GROUP BY q.id, a.id -- for the sum
ORDER BY q.id, a.created_at DESC NULLS LAST, a.id;
The first ORDER BY
item must agree with the DISTINCT ON
clause.
You want the "latest" answer, so a.created_at DESC
is next.
NULLS LAST
because the column might be nullable (you did not disclose).
The final a.id
only serves a tiebreaker in case multiple answers tie on a.created_at
.
Detailed explanation:
After joining to votes
already, the correlated subquery for the sum of votes is not needed:
(SELECT SUM(votes.value) AS votes FROM votes WHERE answers.id =votes.answer_id)
Currently, you probably get incorrect (multiplied) sums. Assuming a one-to-many relationship between answers
and votes
(else, the vote count could just be added as another column to answers
), it's either-or: Either join to the table, then GROUP BY
, or do not join to the table and add that correlated subquery.
I fixed it with a plain sum()
keeping the join, assuming q.id
and a.id
are the respective primary keys of their tables (you did not disclose the table definition). That's possible because DISTINCT ON
is applied after GROUP BY
. See:
Or see below for a possibly better solution.
While you return all or most questions the query is typically faster if you join after getting the latest answer per question. Like:
SELECT q.id AS question_id
, q.title AS question_title
, q.created_at AS question_created_at
, q.updated_at AS question_updated_at
, a.id AS answer_id
, a.content AS answer_content
, a.created_at AS answer_created_at
, a.updated_at AS answer_updated_at
, u.user_name -- whatever you need from users table
, (SELECT SUM(value) FROM votes v WHERE v.answer_id = a.answer_id) AS votes
FROM questions q
LEFT JOIN (
SELECT DISTINCT ON (a.question_id)
a.question_id AS id
, a.id AS answer_id
, a.content AS answer_content
, a.created_at AS answer_created_at
, a.updated_at AS answer_updated_at
, a.user_id
FROM answers a
ORDER BY a.question_id, a.created_at DESC NULLS LAST, a.id
) a USING (id)
LEFT JOIN users u ON u.id = a.user_id
Here, I kept the correlated subquery for votes
because it's typically cheaper to do that after reducing to chosen answers instead of counting for all answers.
Similar for users
(added in your answer): join after reducing to the chosen answer. And put something from users
in the SELECT
list to actually return from it.
If your table answers
is big, a multicolumn index on answer(question_id, created_at DESC NULLS LAST)
would be ideal for performance.
If there are many answers per question, a different query technique may be faster. See:
For retrieving a small percentage of all questions, LATERAL
or correlated subqueries are typically faster.
Details depend on undisclosed table definitions and cardinalities.