Improving performance of join on subselect

Question

Consider this table (comments):

         id | post_id |      text
------------+---------|----------------
      79507 |     12  | Lorem Ipsum
      79544 |     12  | Foo, bar
      79545 |     14  | Interesting...

And this aggregate query:

SELECT comment_id, SUM(vote) AS votes
FROM votes 
GROUP BY comment_id;

 comment_id | votes 
------------+-------
      79507 |    3
      79544 |    4
      79545 |    1

I'm looking to joining the comments table and the aggregate query, but only interested in a very small subset of the data (only a particular post_id). This naive approach uses a subquery to correctly return the result for post_id 12:

SELECT comment_id, votes, text FROM comments c LEFT JOIN
  (SELECT comment_id, SUM(votes) AS vote
   FROM votes 
   GROUP BY comment_id) AS v
ON c.id = v.comment_id 
WHERE c.post_id = 12;

 comment_id | votes |      text
------------+-------|----------------
      79507 |    3  | Lorem Ipsum
      79544 |    4  | Foo, bar

However, this is highly inefficient, since we are computing the inner subquery on the entire table, but we are only interested in a very small subset of it (the votes table in this application is huge).

Intuitively, it seems we should be filtering the inner query and there we're missing a WHERE comment_id IN (...) in the subselect. However, we don't know which comment_ids we will need at that stage in the computation. Another subselect inside the subselect could be used to retrieve the appropriate comment_ids, but that seems very clumsy.

I'm inexperienced in SQL and not sure if there exists a cleaner solution. Perhaps the subselect approach is the wrong one altogether.

You forgot to declare your PostgreSQL version, which should be a given. — Erwin Brandstetter, May 29 '13 at 14:05
If you are working with a current version of Postgres, there is probably no need to list all columns redundantly. The primary key covers all columns of a table. [Details in this related answer.](http://stackoverflow.com/questions/8684486/rails-3-getting-a-pgerror-error-aggregates-not-allowed-in-where-clause-on-a-a/8684512#8684512) — Erwin Brandstetter, May 29 '13 at 14:10

score 3 · Accepted Answer · answered May 29 '13 at 13:11

3

Not sure I understood well, don't you need something like that ?

SELECT c.id as comment_id, SUM (v.vote) as votes, c.text
FROM comments c
LEFT JOIN votes v ON c.id = v.comment_id
WHERE c.post_id = 12
GROUP BY c.id, c.text

answered May 29 '13 at 13:11

Raphaël Althaus

59,727
6
96
122

Oh, wow. That's embarrassingly simple, can't believe I missed it. Thanks! – David Chouinard May 29 '13 at 13:22
Also, as I understand it, this implementation will need the GROUP BY clause to contain every column in both tables (except `votes`), which is a bit ugly and won't work with `SELECT *`. – David Chouinard May 29 '13 at 13:33
1

@DavidChouinard well, every column you wanna retrieve, yes. This is not "ugly" (while `select *` is), this is the only way to go ;) : all fields in the select which are not in an aggregate function must be in the group by clause. – Raphaël Althaus May 29 '13 at 13:36
Well, I said it was ugly because there's redundant information, ie. the columns we want to retrieve listed twice: in the SELECT and the GROUP BY clause. It's a basic principle that redundant data ought to be factored out. But I understand, this is as good as it gets given the limits of SQL. :) – David Chouinard May 29 '13 at 13:58
2

@DavidChouinard: I think the latest version of PG lets you get away with merely listing the primary key in the group by statement, on grounds that it's unique. (At the very least, there was a discussion on PG Hackers on this topic at some point.) – Denis de Bernardy May 29 '13 at 14:40

Improving performance of join on subselect

1 Answers1