Add up conditional counts on multiple columns of the same table

Question

I am looking for a "better" way to perform a query in which I want to show a single player who he has played previously and the associated win-loss record for each such opponent.

Here are the tables involved stripped down to essentials:

create table player (player_id int, username text);
create table match (winner_id int, loser_id int);

insert into player values (1, 'john'), (2, 'mary'), (3, 'bob'), (4, 'alice');
insert into match values (1, 2), (1, 2), (1, 3), (1, 4), (1, 4), (1, 4)
                       , (2, 1), (4, 1), (4, 1);

Thus, john has a record of 2 wins and 1 loss vs mary; 1 win and 0 losses vs bob; and 3 wins and 2 losses vs alice.

create index idx_winners on match(winner_id);
create index idx_winners on match(loser_id);

I am using Postgres 9.4. Something in the back of my head tells me to consider LATERAL somehow but I'm having a hard time understanding the "shape" of such.

The following is the query I am using currently but something "feels off". Please help me learn and improve this.

select p.username as opponent, 
       coalesce(r.won, 0) as won, 
       coalesce(r.lost, 0) as lost
from (
    select m.winner_id, m.loser_id, count(m.*) as won, (
        select t.lost
        from (
            select winner_id, loser_id, count(*) as lost
            from match
            where loser_id = m.winner_id
            and winner_id = m.loser_id
            group by winner_id, loser_id
        ) t 
    )   
    from match m
    where m.winner_id = 1   -- this would be a parameter
    group by m.winner_id, m.loser_id
) r 
join player p on p.player_id = r.loser_id;

This works as expected. Just looking to learn some tricks or better yet proper techniques to do the same.

opponent  won  lost
--------  ---  ----
alice     3    2
bob       1    0
mary      2    1

I would prefer your second query because for me it is more readable. — Renzo, Jun 22 '15 at 16:31

Giorgi Nakeuri · Answer 1 · 2015-06-22T16:39:18.073

Solution with correlated subquery:

SELECT *,
       (SELECT COUNT(*) FROM match WHERE loser_id = p.player_id),
       (SELECT COUNT(*) FROM match WHERE winner_id = p.player_id)
FROM dbo.player p WHERE player_id <> 1

Solution with UNION and conditional aggregation:

SELECT  t.loser_id ,
        SUM(CASE WHEN result = 1 THEN 1 ELSE 0 END) ,
        SUM(CASE WHEN result = -1 THEN 1 ELSE 0 END)
FROM    ( SELECT    * , 1 AS result
          FROM      match
          WHERE     winner_id = 1
          UNION ALL
          SELECT    loser_id , winner_id , -1 AS result
          FROM      match
          WHERE     loser_id = 1
        ) t
GROUP BY t.loser_id

score 2 · Answer 2 · answered Jun 22 '15 at 16:31

For a single 'subject' player, I would simply union the player in both the winning and losing roles, and sum up the wins / losses:

SELECT opponent, SUM(won) as won, SUM(lost) as lost
FROM
(
    select w.username AS opponent, 0 AS won, 1 as lost, m.loser_id as me
    from "match" m
     inner join "player" w on m.winner_id = w.player_id

    UNION ALL

    select l.username AS opponent, 1 AS won, 0 as lost, m.winner_id as me
    from "match" m
     inner join "player" l on m.loser_id = l.player_id
) x
WHERE me = 1
GROUP BY opponent;

For a set based operation, we can just left join the players to the same derived union table:

SELECT p.username as player, x.opponent, SUM(x.won) as won, SUM(x.lost) as lost
FROM "player" p
LEFT JOIN
(
    select w.username AS opponent, 0 AS won, 1 as lost, m.loser_id as me
    from "match" m
     inner join "player" w on m.winner_id = w.player_id

    UNION ALL

    select l.username AS opponent, 1 AS won, 0 as lost, m.winner_id as me
    from "match" m
     inner join "player" l on m.loser_id = l.player_id
) x
on p.player_id = x.me
GROUP BY player, opponent;

SqlFiddles of both here

One small point - the names of the indices must be unique - presumably you meant:

create index idx_winners on match(winner_id);
create index idx_losers on match(loser_id);

Actually your solution is better then my second. I didn't think of selecting results to different columns to omit case statements. — Giorgi Nakeuri, Jun 22 '15 at 16:42
I think there is probably even more efficiency to be had to join to the opponent player just once outside of the derived table. — StuartLC, Jun 22 '15 at 16:43
I updated my question with another method that does what you mention. I will move it to an answer for commenting. — Brian H, Jun 22 '15 at 16:48

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

Query

The query is not as simple as it looks at first. The shortest query string does not necessarily yield best performance. This should be as fast as it gets, being as short as possible for that:

SELECT p.username, COALESCE(w.ct, 0) AS won, COALESCE(l.ct, 0) AS lost
FROM  (
   SELECT loser_id AS player_id, count(*) AS ct
   FROM   match
   WHERE  winner_id = 1  -- your player_id here
   GROUP  BY 1           -- positional reference (not your player_id)
   ) w
FULL JOIN (
   SELECT winner_id AS player_id, count(*) AS ct
   FROM   match
   WHERE  loser_id = 1   -- your player_id here
   GROUP  BY 1
   ) l USING (player_id)
JOIN   player p USING (player_id)
ORDER  BY 1;

Result exactly as requested:

username | won | lost
---------+-----+-----
alice    | 3   | 2
bob      | 1   | 0
mary     | 2   | 1

SQL Fiddle - with more revealing test data!

The key feature is the FULL [OUTER] JOIN between the two subqueries for losses and wins. This produces a table of all players our candidate has played against. The USING clause in the join condition conveniently merges the two player_id columns into one.

After that, a single JOIN to player to get the name, and COALESCE to replace NULL with 0. Voilá.

Index

Would be even faster with two multicolumn indexes:

CREATE INDEX idx_winner on match (winner_id, loser_id);
CREATE INDEX idx_loser  on match (loser_id, winner_id);

Only if you get index-only scans out of this. Then Postgres does not even visit the match table at all and you get super-fast results.

With two integer columns you happen to hit a local optimum: theses indexes have just the same size as the simple ones you had. Details:

Is a composite index also good for queries on the first field?

Shorter, but slow

You could run correlated subqueries like @Giorgi suggested, just working correctly:

SELECT *
FROM  (
   SELECT username
       , (SELECT count(*) FROM match
          WHERE  loser_id  = p.player_id
          AND    winner_id = 1) AS won
       , (SELECT count(*) FROM match
          WHERE  winner_id = p.player_id
          AND    loser_id  = 1) AS lost
   FROM   player p
   WHERE  player_id <> 1
   ) sub
WHERE (won > 0 OR lost > 0)
ORDER  BY username;

Works fine for small tables, but doesn't scale. This needs a sequential scan on player and two index scans on match per existing player. Compare performance with EXPLAIN ANALYZE.

Wow, that's great info. Thank you. This is not unlike the self-answer I posted earlier except using a FULL JOIN which seems cleaner to me. So I grok the index-only scans, the reason for adding say loser_id to idx_winner is that the idx_winner index is used for producing w but you throw in loser_id so that the data is in the index and you avoid loading table data... That's awesome. — Brian H, Jun 23 '15 at 01:59
@JohnnyServers: Yes, you got the point about index-only scans. Other RDBMS (like Oracle) talk about "covering indexes". And yes, the version you posted is close, but loses traction in the outer query, where you introduce a seq scan on `players`. Plus, only use CTEs where necessary, they introduce optimization barriers and add overhead cost. — Erwin Brandstetter, Jun 23 '15 at 02:08

Brian H · Answer 4 · 2015-06-22T17:30:40.900

Something more readable than my original. Thoughts?

with W as (
    select loser_id as opponent_id,
    count(*) as n
    from match
    where winner_id = 1
    group by loser_id
),
L as (
    select winner_id as opponent_id,
    count(*) as n
    from match
    where loser_id = 1
    group by winner_id
)
select player.username, coalesce(W.n, 0) as wins, coalesce(L.n, 0) as losses
from player
left join W on W.opponent_id = player.player_id
left join L on L.opponent_id = player.player_id
where player.player_id != 1;

                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Hash Left Join  (cost=73.78..108.58 rows=1224 width=48)
   Hash Cond: (player.player_id = l.opponent_id)
   CTE w
     ->  HashAggregate  (cost=36.81..36.83 rows=2 width=4)
           Group Key: match.loser_id
           ->  Seq Scan on match  (cost=0.00..36.75 rows=11 width=4)
                 Filter: (winner_id = 1)
   CTE l
     ->  HashAggregate  (cost=36.81..36.83 rows=2 width=4)
           Group Key: match_1.winner_id
           ->  Seq Scan on match match_1  (cost=0.00..36.75 rows=11 width=4)
                 Filter: (loser_id = 1)
   ->  Hash Left Join  (cost=0.07..30.15 rows=1224 width=44)
         Hash Cond: (player.player_id = w.opponent_id)
         ->  Seq Scan on player  (cost=0.00..25.38 rows=1224 width=36)
               Filter: (player_id <> 1)
         ->  Hash  (cost=0.04..0.04 rows=2 width=12)
               ->  CTE Scan on w  (cost=0.00..0.04 rows=2 width=12)
   ->  Hash  (cost=0.04..0.04 rows=2 width=12)
         ->  CTE Scan on l  (cost=0.00..0.04 rows=2 width=12)

The above has a performance killer with the player_id != 1. I think I can avoid that by only scanning the results of the joins, no?

explain with W as (
        select loser_id as opponent_id,
        count(*) as n
        from match
        where winner_id = 1 
        group by loser_id
    ),  
    L as (
        select winner_id as opponent_id,
        count(*) as n
        from match
        where loser_id = 1 
        group by winner_id
    )   
    select t.* from (
        select player.player_id, player.username, coalesce(W.n, 0) as wins, coalesce(L.n, 0) as losses
        from player
        left join W on W.opponent_id = player.player_id
        left join L on L.opponent_id = player.player_id
    ) t 
    where t.player_id != 1;

                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Hash Left Join  (cost=73.78..74.89 rows=3 width=52)
   Hash Cond: (player.player_id = l.opponent_id)
   CTE w
     ->  HashAggregate  (cost=36.81..36.83 rows=2 width=4)
           Group Key: match.loser_id
           ->  Seq Scan on match  (cost=0.00..36.75 rows=11 width=4)
                 Filter: (winner_id = 1)
   CTE l
     ->  HashAggregate  (cost=36.81..36.83 rows=2 width=4)
           Group Key: match_1.winner_id
           ->  Seq Scan on match match_1  (cost=0.00..36.75 rows=11 width=4)
                 Filter: (loser_id = 1)
   ->  Hash Left Join  (cost=0.07..1.15 rows=3 width=44)
         Hash Cond: (player.player_id = w.opponent_id)
         ->  Seq Scan on player  (cost=0.00..1.05 rows=3 width=36)
               Filter: (player_id <> 1)
         ->  Hash  (cost=0.04..0.04 rows=2 width=12)
               ->  CTE Scan on w  (cost=0.00..0.04 rows=2 width=12)
   ->  Hash  (cost=0.04..0.04 rows=2 width=12)
         ->  CTE Scan on l  (cost=0.00..0.04 rows=2 width=12)

The cte definitely adds to readability, but the `!= 1` could be a performance killer. — StuartLC, Jun 22 '15 at 16:58
I added the execution plan from pg 9.4. Trying to understand how to avoid this. I thought PG would eval W then L then do PK lookups on player instead of a full table scan. I see I didn't define player_id as a PK for this example but it would be in real usage. Let me try that... No difference. Couldn't we just avoid that issue but only filtering the results afterwards as in my edit? — Brian H, Jun 22 '15 at 17:23

Add up conditional counts on multiple columns of the same table

4 Answers4

Query

Index

Shorter, but slow

Linked