Can you please help me understand the reason for the performance drop between these statements?
For me it seems like in case of D & E he is first joining the address to all subscribers and at the end applies Offset & Limit. Why on earth would he do that?
Am I missing something about how Subselects and Offset work together? Shouldn't he first find the right offset and then start executing the subselects?
user_id and address_id are primary keys
Select A: 15 ms (OK): select first 200 subscribers
SELECT s.user_id
FROM subscribers s
ORDER BY s.user_id
OFFSET 0 LIMIT 200
Select B: 45 ms (OK): Select last 200 subscribers
SELECT s.user_id
FROM subscribers s
ORDER BY s.user_id
OFFSET 100000 LIMIT 200
Select C: 15 ms (OK): Select first 200 subscribers together with first available address
SELECT s.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id
FROM subscribers s
ORDER BY s.user_id
OFFSET 0 LIMIT 200
Select D: 500 ms (Not OK): Select last 200 subscribers together with first available address
SELECT s.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id
FROM subscribers s
ORDER BY s.user_id
OFFSET 100000 LIMIT 200
Select E: 1000 ms (Even worse): Select last 200 subscribers together with first 2 available addresses
SELECT s.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id_1,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 1 LIMIT 2) AS a_id_2
FROM subscribers s
ORDER BY s.user_id
OFFSET 100000 LIMIT 200
Select F: 15 ms (Nice): Select last 200 subscribers together with first 2 available addresses without offset but WHERE s.user_id > 100385 instead
SELECT s.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id_1,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 1 LIMIT 2) AS a_id_2
FROM subscribers s
WHERE s.user_id > 100385 --same as OFFSET 100000 in my data
ORDER BY s.user_id
LIMIT 200
Execution plan for E:
'Limit (cost=1677635.30..1677635.80 rows=200 width=4) (actual time=2251.503..2251.816 rows=200 loops=1)'
' Output: s.user_id, ((SubPlan 1)), ((SubPlan 2))'
' Buffers: shared hit=607074'
' -> Sort (cost=1677385.30..1677636.08 rows=100312 width=4) (actual time=2146.867..2200.704 rows=100200 loops=1)'
' Output: s.user_id, ((SubPlan 1)), ((SubPlan 2))'
' Sort Key: s.user_id'
' Sort Method: quicksort Memory: 7775kB'
' Buffers: shared hit=607074'
' -> Seq Scan on public.pcv_subscriber s (cost=0.00..1669052.31 rows=100312 width=4) (actual time=0.040..2046.926 rows=100312 loops=1)'
' Output: s.user_id, (SubPlan 1), (SubPlan 2)'
' Buffers: shared hit=607074'
' SubPlan 1'
' -> Limit (cost=8.29..8.29 rows=1 width=4) (actual time=0.008..0.008 rows=1 loops=100312)'
' Output: ua.user_address_id'
' Buffers: shared hit=301458'
' -> Sort (cost=8.29..8.29 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=100312)'
' Output: ua.user_address_id'
' Sort Key: ua.user_address_id'
' Sort Method: quicksort Memory: 25kB'
' Buffers: shared hit=301458'
' -> Index Scan using ix_pcv_user_address_user_id on public.pcv_user_address ua (cost=0.00..8.28 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=100312)'
' Output: ua.user_address_id'
' Index Cond: (ua.user_id = $0)'
' Buffers: shared hit=301458'
' SubPlan 2'
' -> Limit (cost=8.29..8.29 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=100312)'
' Output: ua.user_address_id'
' Buffers: shared hit=301458'
' -> Sort (cost=8.29..8.29 rows=1 width=4) (actual time=0.006..0.007 rows=1 loops=100312)'
' Output: ua.user_address_id'
' Sort Key: ua.user_address_id'
' Sort Method: quicksort Memory: 25kB'
' Buffers: shared hit=301458'
' -> Index Scan using ix_pcv_user_address_user_id on public.pcv_user_address ua (cost=0.00..8.28 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=100312)'
' Output: ua.user_address_id'
' Index Cond: (ua.user_id = $0)'
' Buffers: shared hit=301458'
'Total runtime: 2251.968 ms'
Disclaimer: This is a stripped down example of a much larger and more complex statement that enables a GUI Table to sort/page/filter a subscriber with a lot of additional accumulated data across several tables. So I know this example can be done in better ways. So instead please help me understand why this solution is so slow or at best suggest minimal changes.
Update 1:
This was produced using Postgres 9.0.3
Update 2:
Currently the best solution to my problem I can come up with seems to be this stupid statement:
Select G: 73ms (OKish)
SELECT s.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id_1,
(SELECT address_id FROM address a WHERE a.user_id = s.user_id ORDER BY address_id OFFSET 1 LIMIT 2) AS a_id_2
FROM subscribers s
WHERE s.user_id >= (SELECT user_id from subscribers ORDER BY user_id OFFSET 100000 LIMIT 1)
ORDER BY s.user_id
LIMIT 200
Update 3:
Best select so far from David. (same performance as G but more intuitive)
Select H: 73ms (OKish)
SELECT s2.user_id,
(SELECT address_id FROM address a WHERE a.user_id = s2.user_id ORDER BY address_id OFFSET 0 LIMIT 1) AS a_id
FROM (SELECT s.user_id
FROM subscribers s
ORDER BY s.user_id
OFFSET 100000 LIMIT 200) s2
Execution plan for H:
This is how I imagined it to be for E as well in the first place.