12

Consider a voting system implemented in PostgreSQL, where each user can vote up or down on a "foo". There is a foo table that stores all the "foo information", and a votes table that stores the user_id, foo_id, and vote, where vote is +1 or -1.

To get the vote tally for each foo, the following query would work:

SELECT sum(vote) FROM votes WHERE foo.foo_id = votes.foo_id;

But, the following would work just as well:

(SELECT count(vote) FROM votes 
 WHERE foo.foo_id = votes.foo_id 
 AND votes.vote = 1)
- (SELECT count(vote) FROM votes 
   WHERE foo.foo_id = votes.foo_id 
   AND votes.vote = (-1))

I currently have an index on votes.foo_id.

Which is a more efficient approach? (In other words, which would run faster?) I'm interested in both the PostgreSQL-specific answer and the general SQL answer.

EDIT

A lot of answers have been taking into account the case where vote is null. I forgot to mention that there is a NOT NULL constraint on the vote column.

Also, many have been pointing out that the first is much easier to read. Yes, it is definitely true, and if a colleague wrote the 2nd one, I would be exploding with rage unless there was a performance necessity. Never the less, the question is still on the performance of the two. (Technically, if the first query was way slower, it wouldn't be such a crime to write the second query.)

ryanrhee
  • 2,550
  • 4
  • 23
  • 25
  • 3
    `EXPLAIN ANALYZE ...`?! – deceze Feb 21 '13 at 09:05
  • 3
    I can't think of any way in which the second could be faster - you're still having to read the `vote` column for every row. – Damien_The_Unbeliever Feb 21 '13 at 09:06
  • 12
    @deceze is right. However, one important aspect I would recommend to look at: the first one with sum() is readable, and comprehensible on sight. The second one instantly throws a "WTFException" in my brain, then after being able to decipher it, whoDidThis() gets run by a non-maskable interrupt immediately. Side effects of this method might include slapping someone with a large trout... – ppeterka Feb 21 '13 at 09:09
  • Do you want to get the score for *one* particular `foo`, or for *all* `foo` at the same time? – Erwin Brandstetter Feb 21 '13 at 12:38
  • 1
    Are you implementing StackOverflow!? – James Webster Feb 21 '13 at 13:24
  • @ErwinBrandstetter A typical client will probably be looking at 100 `foo`s at a time. – ryanrhee Feb 21 '13 at 18:30
  • I didn't understand your last edit... As I said in my comment, the `null` would be a problem when there is no rows returned (`sum` would return `null`, and `count` would return `0`, which mean different results, so, the queries are not equivalent)... And about the performance, everyone stated (correctly) that the first one would be easier, but also faster! – MatheusOl Feb 21 '13 at 18:43

3 Answers3

13

Of course, the first example is faster, simpler and easier to read. Should be obvious even before one gets slapped with aquatic creatures. While sum() is slightly more expensive than count(), what matters much, much more is that the second example need two scans.

But there is an actual difference, too: sum() can return NULL where count() doesn't. I quote the manual on aggregate functions:

It should be noted that except for count, these functions return a null value when no rows are selected. In particular, sum of no rows returns null, not zero as one might expect,

Since you seem to have a weak spot for performance optimization, here's a detail you might like: count(*) is slightly faster than count(vote). Only equivalent if vote is NOT NULL. Test performance with EXPLAIN ANALYZE.

On closer inspection

Both queries are syntactical nonsense, standing alone. It only makes sense if you copied them from the SELECT list of a bigger query like:

SELECT *, (SELECT sum(vote) FROM votes WHERE votes.foo_id = foo.foo_id)
FROM   foo;

The important point here is the correlated subquery - which may be fine if you are only reading a small fraction of votes in your query. We would see additional WHERE conditions, and you should have matching indexes.

In Postgres 9.3 or later, the alternative, cleaner, 100 % equivalent solution would be with LEFT JOIN LATERAL ... ON true:

SELECT *
FROM   foo f
LEFT   JOIN LATERAL (
   SELECT sum(vote) FROM votes WHERE foo_id = f.foo_id
   ) v ON true;

Typically similar performance. Details:

However, while reading large parts or all from table votes, this will be (much) faster:

SELECT f.*, v.score
FROM   foo f
JOIN   (
   SELECT foo_id, sum(vote) AS score
   FROM   votes
   GROUP  BY 1
   ) v USING (foo_id);

Aggregate values in a subquery first, then join to the result.
About USING:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    I'm really surprised about the "*`count(*)` is slightly faster than `count(vote)`*" thing (it indeed seems so). Can you share some insight why this is the case? I would have expected that both have to do the same work –  Feb 21 '13 at 12:49
  • @a_horse_with_no_name: `count(*)` does not have to check on the *value* of the column itself, the mere existence of the row is enough. – Erwin Brandstetter Feb 21 '13 at 12:51
  • But with a column defined as `not null`, the value does not need to be checked either. –  Feb 21 '13 at 12:53
  • 1
    @a_horse_with_no_name: With `count(*)` Postgres doesn't even have to check for the definition of the column. Also: When joining tables, (like with a `LEFT JOIN`), even a column defined `NOT NULL` can end up holding a `NULL` value. – Erwin Brandstetter Feb 21 '13 at 12:57
  • But the check for `not null` is done only once during the planning stage, so that shouldn't increase the execution time. And for a plain select (no joins) nulls cannot happen. Seems that the optimizer could be improved there ;) –  Feb 21 '13 at 12:59
  • @a_horse_with_no_name: Seems like there is some potential. I would have to dig deeper if there are other factors in play. – Erwin Brandstetter Feb 21 '13 at 13:06
  • omg, i didn't even think about doing a JOIN for this. I'm assuming the JOIN is faster b/c a subquery would have take O(n log n) while the JOIN takes O(n), in terms of looking through the `vote` table? Is the JOIN still faster if the query was changed to have `ORDER BY foo_time LIMIT 100` at the end? – ryanrhee Feb 21 '13 at 18:45
  • @orryowr: The subquery is faster because the table is scanned only *once* instead of n times. The exact cost depends on more details and indexes. Concerning the added LIMIT, I'd say test with `EXPLAIN ANALYZE`. The theory only goes so far. It depends on the details: data distribution, cardinality, row size, where does `foo_time` come from ... If you *can*, it would pay to apply the condition in the subquery also. – Erwin Brandstetter Feb 22 '13 at 07:33
  • In your opinion, I add a new column in the `Posts` table named `total_votes`, and update it via *TRIGGER* after each insert/delete on `Votes` table ? Or do I use of `SUM()` for each request and calculate the number of total ? – Shafizadeh Aug 18 '15 at 13:44
  • @Sajad: Please ask new questions as new *question*. Comments are not the place. You can always *link* to this one for context. – Erwin Brandstetter Aug 18 '15 at 14:04
  • no no, my answer is just one word, `trigger` or `sum()` ? If I ask that on SO, It will get down votes. – Shafizadeh Aug 18 '15 at 14:08
  • @Sajad: The answer depends on the complete situation and your requirements. I would have to know data distribution, indexes, typical use cases etc. to weigh the alternatives. Triggers are more error prone but may help with performance. If in doubt, avoid the trigger solution. – Erwin Brandstetter Aug 18 '15 at 14:11
2

The first one will be faster. You can try it on a simple way.

Generate some data:

CREATE TABLE votes(foo_id integer, vote integer);
-- Insert 1000000 rows into 100 foos (1 to 100)
INSERT INTO votes SELECT round(random()*99)+1, CASE round(random()) WHEN 0 THEN -1 ELSE 1 END FROM generate_series(1, 1000000);
CREATE INDEX idx_votes_id ON votes (foo_id);

Check both

EXPLAIN ANALYZE SELECT SUM(vote) FROM votes WHERE foo_id = 5;
EXPLAIN ANALYZE SELECT (SELECT COUNT(*) AS count FROM votes WHERE foo_id=5 AND vote=1) - (SELECT COUNT(*)*-1 AS count FROM votes WHERE foo_id=5 AND vote=-1);

But the truth is that they are not equivalent, to make sure the first one will work as the second, you need to treat for the null case:

SELECT COALESCE(SUM(vote), 0) FROM votes WHERE foo_id = 5;

One more thing. If you are using PostgreSQL 9.2, you can create your index with both columns in it, and that way you can have a chance of using index-only scan:

CREATE INDEX idx_votes_id ON votes (foo_id, vote);

BUT! In some situations this index may be worst, so you should try with both and run EXPLAIN ANALYZE to see which one is the best, or even create both and check which one PostgreSQL is using most (and exclude the other).

TylerH
  • 20,799
  • 66
  • 75
  • 101
MatheusOl
  • 10,870
  • 3
  • 30
  • 28
  • thank you. as far as the `null` case, the column have a `not null` constraint. – ryanrhee Feb 21 '13 at 18:29
  • @orryowr, doesn't matter, if there is no register (e.g. no votes given) with `foo_id = 5`, for instance, `SUM` will also return `null`. BTW, if there was `null`, `SUM` would just ignore those (so as `count`) and not return `null`. – MatheusOl Feb 21 '13 at 18:40
1

I would expect the first query to work faster as this is a single query and it's more readable (handy in case you'd have to get back to this after some time).

Second query consists of two queries. You only get a result as if it was a single query.

That said, to be absolutely sure which of these works better for you I would populate both tables with lots of dummy data and check the query execution time.

Mike
  • 1,158
  • 5
  • 22
  • 32
  • I'm not sure about that. It's quite possible that taking a count of rows is far more efficient than having to use the content of a row to calculate a number. But I would still choose Sum over Count for readability purposes. – Flater Feb 21 '13 at 12:52
  • @Flater: I'm not sure about that too. As I said, first option greatly improves readability while only a benchmark test could tell which approach is more efficient. – Mike Feb 21 '13 at 13:24
  • Unless this query will be run on a disgustingly big database, I would value the readability over the minor performance gain though :) – Flater Feb 21 '13 at 14:15
  • Me too, but the question was which one is more efficient. I'd still expect the first query not only to be more readable, but also to perform better. – Mike Feb 22 '13 at 12:59