0

I'm having serious performance issues when using a LEFT OUTER JOIN and trying to use a column in the right table in Postgres. I have a table of users and a table with online_users that lists user ids that are online in my website. Both tables have indexes in the user ids. I need to run a select on the users table and list first users that are online, followed by users who aren't online. So my select is:

SELECT *
FROM users
LEFT JOIN online_users ON (users.id = online_users.usr_id)
ORDER BY online_users.online_date

I have indexes on users.id, online_users.usr_id and online_users.online_date, but for some reason, when I run ANALYZE on the query, the index for online_users.online_date isn't used by Postgres and the full scan ruins the query's performance.

Is there any way to optimize this query without changing the tables' structure (these tables are replicated, so changing the structure will require a major refactoring of our project).

Postgre version is 9.3

Below is the Explain Analyze:

                                                          QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=2589440.94..2595456.84 rows=2406361 width=506) (actual time=18635.686..25775.334 rows=2239030 loops=1)
   Sort Key: usuarios_online.datamessenger
   Sort Method: external merge  Disk: 512424kB
   ->  Hash Left Join  (cost=219.73..130113.66 rows=2406361 width=506) (actual time=0.723..12388.266 rows=2239030 loops=1)
         Hash Cond: (usuarios.id = usuarios_online.id_usr)
         ->  Seq Scan on usuarios  (cost=0.00..108832.61 rows=2406361 width=494) (actual time=0.009..7328.191 rows=2238984 loops=1)
         ->  Hash  (cost=212.66..212.66 rows=566 width=12) (actual time=0.704..0.704 rows=572 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 27kB
               ->  Seq Scan on usuarios_online  (cost=0.00..212.66 rows=566 width=12) (actual time=0.079..0.555 rows=572 loops=1)
 Total runtime: 28519.611 ms
(10 rows)
Andre
  • 21
  • 3
  • 1
    Please post the full output of your `EXPLAIN ANALYZE` (click [edit] and add it to the question). – IMSoP Jul 23 '15 at 21:48
  • 1
    `(these tables are replicated, so changing the structure will require a major refactoring of our project).` Data model design problems are real problems, too. If it is wrong: change it. now. or better: yesterday. – wildplasser Jul 23 '15 at 21:53
  • Please consider instructions for performance queries in the tag-info to [\[postgresql-performance\]](http://stackoverflow.com/tags/postgresql-performance/info). – Erwin Brandstetter Jul 24 '15 at 02:23
  • @IMSoP - I added the EXPLAIN ANALYZE output. Thanks – Andre Jul 24 '15 at 22:58

1 Answers1

2

Since you only order rows from online_users, it makes sense to use a UNION query instead:

(
SELECT usr_id, online_date  -- more columns?
FROM   online_users
ORDER  BY online_date
)
UNION ALL
SELECT u.id, NULL  -- more matching columns?
FROM   users u
LEFT   JOIN online_users o ON u.id = o.usr_id
WHERE  o.usr_id IS NULL;

Should be much faster in any case.

online_users can easily utilize an index on online_date now.
The two simpler query plans can generally use indexes more easily.
All other users don't have to be sorted at all. The second SELECT only needs to exclude online_users:

Parentheses around the first SELECT are required to allow the ORDER BY where I put it.

This might be further optimized, depending on undeclared specifics of the case.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Hi, thanks for the reply. I tried to use an union statement but the result was the same in terms of performance. I added the Explain Analyse result for the query in the question. – Andre Jul 24 '15 at 22:55