4

Can you please help me understand the reason for the performance drop between these statements?

For me it seems like in case of D & E he is first joining the address to all subscribers and at the end applies Offset & Limit. Why on earth would he do that?

Am I missing something about how Subselects and Offset work together? Shouldn't he first find the right offset and then start executing the subselects?

user_id and address_id are primary keys

Select A: 15 ms (OK): select first 200 subscribers

SELECT s.user_id
FROM subscribers s
ORDER BY s.user_id
OFFSET 0 LIMIT 200

Select B: 45 ms (OK): Select last 200 subscribers

SELECT s.user_id
FROM subscribers s
ORDER BY s.user_id
OFFSET 100000 LIMIT 200

Select C: 15 ms (OK): Select first 200 subscribers together with first available address

SELECT s.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id
FROM subscribers s
ORDER BY s.user_id
OFFSET 0 LIMIT 200

Select D: 500 ms (Not OK): Select last 200 subscribers together with first available address

SELECT s.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id
FROM subscribers s
ORDER BY s.user_id
OFFSET 100000 LIMIT 200

Select E: 1000 ms (Even worse): Select last 200 subscribers together with first 2 available addresses

SELECT s.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id_1,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 1 LIMIT 2) AS a_id_2
FROM subscribers s
ORDER BY s.user_id
OFFSET 100000 LIMIT 200

Select F: 15 ms (Nice): Select last 200 subscribers together with first 2 available addresses without offset but WHERE s.user_id > 100385 instead

SELECT s.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id_1,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 1 LIMIT 2) AS a_id_2
FROM subscribers s
WHERE s.user_id > 100385 --same as OFFSET 100000 in my data
ORDER BY s.user_id
LIMIT 200

Execution plan for E:

Visual Plan

'Limit  (cost=1677635.30..1677635.80 rows=200 width=4) (actual time=2251.503..2251.816 rows=200 loops=1)'
'  Output: s.user_id, ((SubPlan 1)), ((SubPlan 2))'
'  Buffers: shared hit=607074'
'  ->  Sort  (cost=1677385.30..1677636.08 rows=100312 width=4) (actual time=2146.867..2200.704 rows=100200 loops=1)'
'        Output: s.user_id, ((SubPlan 1)), ((SubPlan 2))'
'        Sort Key: s.user_id'
'        Sort Method:  quicksort  Memory: 7775kB'
'        Buffers: shared hit=607074'
'        ->  Seq Scan on public.pcv_subscriber s  (cost=0.00..1669052.31 rows=100312 width=4) (actual time=0.040..2046.926 rows=100312 loops=1)'
'              Output: s.user_id, (SubPlan 1), (SubPlan 2)'
'              Buffers: shared hit=607074'
'              SubPlan 1'
'                ->  Limit  (cost=8.29..8.29 rows=1 width=4) (actual time=0.008..0.008 rows=1 loops=100312)'
'                      Output: ua.user_address_id'
'                      Buffers: shared hit=301458'
'                      ->  Sort  (cost=8.29..8.29 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=100312)'
'                            Output: ua.user_address_id'
'                            Sort Key: ua.user_address_id'
'                            Sort Method:  quicksort  Memory: 25kB'
'                            Buffers: shared hit=301458'
'                            ->  Index Scan using ix_pcv_user_address_user_id on public.pcv_user_address ua  (cost=0.00..8.28 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=100312)'
'                                  Output: ua.user_address_id'
'                                  Index Cond: (ua.user_id = $0)'
'                                  Buffers: shared hit=301458'
'              SubPlan 2'
'                ->  Limit  (cost=8.29..8.29 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=100312)'
'                      Output: ua.user_address_id'
'                      Buffers: shared hit=301458'
'                      ->  Sort  (cost=8.29..8.29 rows=1 width=4) (actual time=0.006..0.007 rows=1 loops=100312)'
'                            Output: ua.user_address_id'
'                            Sort Key: ua.user_address_id'
'                            Sort Method:  quicksort  Memory: 25kB'
'                            Buffers: shared hit=301458'
'                            ->  Index Scan using ix_pcv_user_address_user_id on public.pcv_user_address ua  (cost=0.00..8.28 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=100312)'
'                                  Output: ua.user_address_id'
'                                  Index Cond: (ua.user_id = $0)'
'                                  Buffers: shared hit=301458'
'Total runtime: 2251.968 ms'

Disclaimer: This is a stripped down example of a much larger and more complex statement that enables a GUI Table to sort/page/filter a subscriber with a lot of additional accumulated data across several tables. So I know this example can be done in better ways. So instead please help me understand why this solution is so slow or at best suggest minimal changes.

Update 1:

This was produced using Postgres 9.0.3

Update 2:

Currently the best solution to my problem I can come up with seems to be this stupid statement:

Select G: 73ms (OKish)

SELECT s.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id_1,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 1 LIMIT 2) AS a_id_2
FROM subscribers s
WHERE s.user_id >= (SELECT user_id from subscribers ORDER BY user_id OFFSET 100000 LIMIT 1)
ORDER BY s.user_id
LIMIT 200

Update 3:

Best select so far from David. (same performance as G but more intuitive)

Select H: 73ms (OKish)

SELECT s2.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s2.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id
FROM (SELECT s.user_id
      FROM  subscribers s
      ORDER BY s.user_id
      OFFSET 100000 LIMIT 200) s2

Execution plan for H:

This is how I imagined it to be for E as well in the first place. enter image description here

Torge
  • 2,174
  • 1
  • 23
  • 33
  • Conclusion: OFFSET consided harmful ??? – wildplasser Mar 27 '15 at 23:39
  • I would rather say, that the planner seems to have a performance bug when using offsets together with sub-queries. Not that Offset is harmful in general. You can not avoid using offset in most situations anyway. – Torge Mar 28 '15 at 14:47
  • http://use-the-index-luke.com/no-offset –  Mar 28 '15 at 15:42
  • `user_id and address_id are primary keys` That is insufficient information. Please add the real data definition. (`user_id` appears to be a FK in the addresses table, for instance) – wildplasser Mar 28 '15 at 16:28
  • @ a_horse_with_no_name Nice read. But what if the user jumps from page 1 to 33 of 100? @ wildplasser Yes its an FK. But I don't see why this matters to my questions why the optimizer seemingly executes unnecessary sub-queries. – Torge Mar 28 '15 at 16:31
  • I was trying to do something similar with offset my self. I detailed my response: https://stackoverflow.com/a/56832916/5267723 – John Youngtree Jul 01 '19 at 09:13

2 Answers2

2

I think that the join expressed in the SELECT clause is being executed even for the 100000 rows you are not including in the final data set.

How about this:

SELECT s2.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s2.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id
FROM (select *
      from   subscribers s
      ORDER BY s.user_id
      OFFSET 100000 LIMIT 200) s2

Failing that, try a common table expression:

With s2 as (
  select *
  from   subscribers s
  ORDER BY s.user_id
  OFFSET 100000 LIMIT 200)
SELECT s2.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s2.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id
FROM s2
David Aldridge
  • 51,479
  • 8
  • 68
  • 96
  • Thanx for your proposals. The second way is new to me. Here are the times: 75ms & 100ms. So your first solution has same performance as my G proposal but makes more sense. Anyhow I still want to understand if this behaviour is a bug or has an unavoidable reason. – Torge Mar 28 '15 at 14:51
  • I think that we see intuitively that query D could perform similarly to query B because the query in the SELECT clause could be ignored for those first 100,000 rows, but the optimiser does not recognise that. I wouldn't call it a bug, more of an unimplemented potential optimisation. – David Aldridge Mar 28 '15 at 16:00
  • Thats why I would call it a "performence bug" in the optimizer, because I kinda expect that a modern optimizer can see that nothing from the sub-queries is referenced. There is not even a where, group by or having statement. But I guess you are right with "missed opportunity". – Torge Mar 28 '15 at 16:16
1

This seems to perform reasonable for the ranks={1,2} case. (CTE's were terrible, FYI)

-- EXPLAIN ANALYZE
SELECT s.user_id
        , MAX (CASE WHEN a0.rn = 1 THEN a0.address_id ELSE NULL END) AS ad1
        , MAX (CASE WHEN a0.rn = 2 THEN a0.address_id ELSE NULL END) AS ad2
FROM subscribers s
JOIN (  SELECT user_id, address_id
        , row_number() OVER(PARTITION BY user_id ORDER BY address_id) AS rn
        FROM address
        )a0 ON a0.user_id = s.user_id AND a0.rn <= 2
GROUP BY s.user_id
ORDER BY s.user_id
OFFSET 10000 LIMIT 200
        ;

UPDATE: the query below seems to perform slightly better:

    -- ----------------------------------
-- EXPLAIN ANALYZE
SELECT s.user_id
        , MAX (CASE WHEN a0.rn = 1 THEN a0.address_id ELSE NULL END) AS ad1
        , MAX (CASE WHEN a0.rn = 2 THEN a0.address_id ELSE NULL END) AS ad2
FROM ( SELECT user_id
        FROM subscribers
        ORDER BY user_id
        OFFSET 10000
        LIMIT 200
        ) s 
JOIN (     SELECT user_id, address_id
        , row_number() OVER(PARTITION BY user_id ORDER BY address_id) AS rn
        FROM address
        ) a0 ON a0.user_id = s.user_id AND a0.rn <= 2
GROUP BY s.user_id
ORDER BY s.user_id
        ;

Note: in both the JOINS should probably LEFT JOINs, to allow for the 1st and 2nd address to be missing.


UPDATE: combining the subsetting subquery (like in @David Aldridfge 's answer) with the original (two scalar subqueries)

Self-joining the subscribers table with itself allows indexes to be used for the scalar subqueries, without the need to throw away the first 100K result-rows.

-- EXPLAIN ANALYZE
SELECT s.user_id
, (SELECT address_id
        FROM address a
        WHERE a.user_id = s.user_id
        ORDER BY address_id OFFSET 0 LIMIT 1
        ) AS a_id1
, (SELECT address_id
        FROM address a
        WHERE a.user_id = s.user_id
        ORDER BY address_id OFFSET 1 LIMIT 1
        ) AS a_id2
FROM subscribers s
JOIN (
        SELECT user_id
        FROM subscribers
        ORDER BY user_id
        OFFSET 10000 LIMIT 200
        ) x ON x.user_id = s.user_id
ORDER BY s.user_id
        ;
wildplasser
  • 43,142
  • 8
  • 66
  • 109