7

I'm trying to do a simple join between a table (players) and view (player_main_colors):

SELECT P.*, C.main_color FROM players P
    OUTER LEFT JOIN player_main_colors C USING (player_id)
    WHERE P.user_id=1;

This query is taking ~40 ms.

Here I'm using a nested SELECT on the VIEW instead of the JOIN:

SELECT player_id, main_color FROM player_main_colors
    WHERE player_id IN (
        SELECT player_id FROM players WHERE user_id=1);

This query is also taking ~40 ms.

When I split the query into its 2 pieces, it becomes fast as I would have expected:

SELECT player_id FROM players WHERE user_id=1;

SELECT player_id, main_color FROM player_main_colors
    where player_id in (584, 9337, 11669, 12096, 13651,
        13852, 9575, 23388, 14339, 500, 24963, 25630,
        8974, 13048, 11904, 10537, 20362, 9216, 4747, 25045);

These queries take ~0.5 ms each.

So why are the above queries with the JOIN or sub-SELECT so horribly slow and how can I fix it?

Here are some details about my tables and the view:

CREATE TABLE users (
    user_id INTEGER PRIMARY KEY,
    ...
)

CREATE TABLE players (
    player_id INTEGER PRIMARY KEY,
    user_id INTEGER NOT NULL REFERENCES users (user_id),
    ...
)

CREATE TABLE player_data (
    player_id INTEGER NOT NULL REFERENCES players (player_id),
    game_id INTEGER NOT NULL,
    color INTEGER NOT NULL,
    PRIMARY KEY (player_id, game_id, color),
    active_time INTEGER DEFAULT 0,
    ...
)

CREATE VIEW player_main_colors AS
    SELECT DISTINCT ON (1) player_id, color as main_color
        FROM player_data
        GROUP BY player_id, color
        ORDER BY 1, MAX(active_time) DESC

It seems like it must be a problem with my VIEW...?

Here's an EXPLAIN ANALYZE for the nested SELECT query above:

Merge Semi Join  (cost=1877.59..2118.00 rows=6851 width=8) (actual time=32.946..38.471 rows=25 loops=1)
   Merge Cond: (player_data.player_id = players.player_id)
   ->  Unique  (cost=1733.19..1801.70 rows=13701 width=12) (actual time=32.651..37.209 rows=13419 loops=1)
         ->  Sort  (cost=1733.19..1767.45 rows=13701 width=12) (actual time=32.646..34.918 rows=16989 loops=1)
               Sort Key: player_data.player_id, (max(player_data.active_time))
               Sort Method: external merge  Disk: 376kB
               ->  HashAggregate  (cost=654.79..791.80 rows=13701 width=12) (actual time=13.636..19.051 rows=17311 loops=1)
                     ->  Seq Scan on player_data  (cost=0.00..513.45 rows=18845 width=12) (actual time=0.005..1.801 rows=18845 loops=1)
   ->  Sort  (cost=144.40..144.53 rows=54 width=8) (actual time=0.226..0.230 rows=54 loops=1)
         Sort Key: players.player_id
         Sort Method: quicksort  Memory: 19kB
         ->  Bitmap Heap Scan on players  (cost=4.67..142.85 rows=54 width=8) (actual time=0.035..0.112 rows=54 loops=1)
               Recheck Cond: (user_id = 1)
               ->  Bitmap Index Scan on test  (cost=0.00..4.66 rows=54 width=0) (actual time=0.023..0.023 rows=54 loops=1)
                     Index Cond: (user_id = 1)
 Total runtime: 39.279 ms

As for indexes, I only have 1 relevant one on top of the default ones for my primary keys:

CREATE INDEX player_user_idx ON players (user_id);

I'm currently using PostgreSQL 9.2.9.

Update:

I've reduced the problem below. See the difference between IN (4747) and IN (SELECT 4747).

Slow:

>> explain analyze SELECT * FROM (
          SELECT player_id, color 
            FROM player_data
            GROUP BY player_id, color
            ORDER BY MAX(active_time) DESC
       ) S
       WHERE player_id IN (SELECT 4747);

 Hash Join  (cost=1749.99..1975.37 rows=6914 width=8) (actual time=30.492..34.291 rows=4 loops=1)
   Hash Cond: (player_data.player_id = (4747))
   ->  Sort  (cost=1749.95..1784.51 rows=13827 width=12) (actual time=30.391..32.655 rows=17464 loops=1)
         Sort Key: (max(player_data.active_time))
         Sort Method: external merge  Disk: 376kB
         ->  HashAggregate  (cost=660.71..798.98 rows=13827 width=12) (actual time=12.714..17.249 rows=17464 loops=1)
               ->  Seq Scan on player_data  (cost=0.00..518.12 rows=19012 width=12) (actual time=0.006..1.898 rows=19012 loops=1)
   ->  Hash  (cost=0.03..0.03 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 1kB
         ->  HashAggregate  (cost=0.02..0.03 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=1)
               ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.001 rows=1 loops=1)
 Total runtime: 35.015 ms
(12 rows)

Time: 35.617 ms

Fast:

>> explain analyze SELECT * FROM (
          SELECT player_id, color 
            FROM player_data
            GROUP BY player_id, color
            ORDER BY MAX(active_time) DESC
       ) S
       WHERE player_id IN (4747);

 Subquery Scan on s  (cost=17.40..17.45 rows=4 width=8) (actual time=0.035..0.035 rows=4 loops=1)
   ->  Sort  (cost=17.40..17.41 rows=4 width=12) (actual time=0.034..0.034 rows=4 loops=1)
         Sort Key: (max(player_data.active_time))
         Sort Method: quicksort  Memory: 17kB
         ->  GroupAggregate  (cost=0.00..17.36 rows=4 width=12) (actual time=0.020..0.027 rows=4 loops=1)
               ->  Index Scan using player_data_pkey on player_data  (cost=0.00..17.28 rows=5 width=12) (actual time=0.014..0.019 rows=5 loops=1)
                     Index Cond: (player_id = 4747)
 Total runtime: 0.080 ms
(8 rows)

Time: 0.610 ms
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
user202987
  • 1,196
  • 3
  • 14
  • 24
  • 1
    Did you try an `exists` query? `... FROM player_main_colors p1 WHERE exists (SELECT 1 FROM players p2 where p2.player_id = p1.player_id and p2.user_id=1)` –  Jan 24 '15 at 23:24
  • I had not tried that, but it appears to take 40ms as well. – user202987 Jan 24 '15 at 23:45
  • 1
    Have you analyzed all of the tables in question lately? – Mark Roberts Jan 25 '15 at 01:57
  • Yes, I have analyzed everything. – user202987 Jan 25 '15 at 02:02
  • What does the explain analyze look for in the "optimal" 0.5 ms case? I've seen cases where the optimizer is able to make better plans based on explicit parameters instead of "implied" parameters (all users where user_id = 1). – Mark Roberts Jan 25 '15 at 02:04
  • I've added some additional information. You're right that the query planner seems unable to handle this. I'm not sure if there's anything to do about it other than rewrite my queries not to use the VIEW, which makes them all uglier but allows me to move the WHERE clauses inside the SELECT that's currently the VIEW, which helps the planner. – user202987 Jan 25 '15 at 02:07
  • 1
    "Sort Method: external merge Disk: 376kB", could you show us your configuration? Especially work_mem, it looks like it's using a very low setting, not enough for sorting in memory. – Frank Heikens Jan 25 '15 at 10:25
  • Either what @FrankHeikens says, or set random_page_cost to a lower value, to let the optimiser favorize indexed accesses. (or the tables are too small) – wildplasser Jan 25 '15 at 12:29
  • The queries added in the update are much simpler and not equivalent to the two variants presented at first. – Erwin Brandstetter Jan 25 '15 at 22:35

2 Answers2

8

You have both GROUP BY and DISTINCT ON in your VIEW definition. That's like shooting a dead man. Simplify:

CREATE VIEW player_main_colors AS
SELECT DISTINCT ON (player_id)
       player_id, color AS main_color
FROM   player_data
ORDER  BY player_id, active_time DESC NULLS LAST;

NULLS LAST is necessary to be equivalent to your original because active_time can be NULL according to your table definition. Should be faster. But there is more. For best performance create these indexes:

CREATE INDEX players_up_idx ON players (user_id, player_id);
CREATE INDEX players_data_pa_idx ON player_data
    (player_id, active_time DESC NULLS LAST, color);

Use DESC NULLS LAST in the index as well to match the sort order of the query. Or you declare the column player_data.active_time as NOT NULL and simplify all.

It's LEFT OUTER JOIN not OUTER LEFT JOIN - or just omit the noise word OUTER:

SELECT *  -- equivalent here to "p.*, c.main_color"
FROM   players p
LEFT   JOIN player_main_colors c USING (player_id)
WHERE  p.user_id = 1;

I'll assume there are lots of rows in player_data for each player_id. And you are only selecting a few player_id. JOIN LATERAL would be fastest for this case, but you need Postgres 9.3 or later for that. In pg 9.2 you can achieve a similar effect with correlated subqueries:

CREATE VIEW player_main_colors AS
SELECT player_id
    , (SELECT color 
       FROM   player_data
       WHERE  player_id = p.player_id
       ORDER  BY active_time DESC NULLS LAST
       LIMIT  1) AS main_color
FROM   players p
ORDER  BY 1;  -- optional

Subtle difference to your original view: this includes players without any entries in player_data. You could try the same query as above based on the new view. But I would not use a view at all. This is probably fastest:

SELECT *
    , (SELECT color 
       FROM   player_data
       WHERE  player_id = p.player_id
       ORDER  BY active_time DESC NULLS LAST
       LIMIT  1) AS main_color
FROM   players p
WHERE  p.user_id = 1;

Detailed explanation:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    Great information, thank you very much. In my environment: VIEW without shooting the dead man: 20ms, VIEW with correlated subqueries: 90ms, direct query: 1ms. I've decided to maintain a main_color column in the players table for now since it's practical and will reduce the complexity of several queries. – user202987 Jan 25 '15 at 21:00
  • @user202987: Maintaining a redundant column also has various costs. It makes writes more expensive, the table bigger and introduces additional indexes, which reduces the benefit from each. Lots of pros and cons. With the overwhelming performance of the direct query, I would use that. – Erwin Brandstetter Jan 25 '15 at 22:26
0

So, the reason for this behavior is that the query planner has limitations. In the specific bind param case, the query planner is able to make specific plans based on the query it can see and analyze. However, when things happen via joins and subselects, there's much less visibility into what will happen. It makes the optimizer use a more "generic" plan - one that in this case is significantly slower.

The right answer for you appears to be making two selects. Perhaps a better answer might be to denormalize "main_color" onto your player table and update it at regular intervals.

Mark Roberts
  • 462
  • 4
  • 6
  • 1
    Two selects would be an inferior solution. A single call to the db is almost always faster by a long shot. Denormalizing is most probably *not* necessary with improved queries and indexes. Finally, joins and subselects are not a problem for the query planner at all. Prepared statements have to prepare for any possible parameter value, which can force a more generic query plan. – Erwin Brandstetter Jan 25 '15 at 04:44