I'm working with the HackerNews dataset in Postgres. There are about 17M rows about 14.5M of them are comments and about 2.5M are stories. There is a very active user named "rbanffy" who has 25k submissions, about equal split stories/comments. Both "by" and "type" have separate indices.
I have a query:
SELECT *
FROM "hn_items"
WHERE by = 'rbanffy'
and type = 'story'
ORDER BY id DESC
LIMIT 20 OFFSET 0
That runs quickly (it's using the 'by' index). If I change the type to "comment" then it's very slow. From the explain, it doesn't use either index and does a scan.
Limit (cost=0.56..56948.32 rows=20 width=1937)
-> Index Scan using hn_items_pkey on hn_items (cost=0.56..45823012.32 rows=16093 width=1937)
Filter: (((by)::text = 'rbanffy'::text) AND ((type)::text = 'comment'::text))
If I change the query to have type||''='comment'
, then it will use the 'by' index and executes quickly.
Why is this happening? I understand from https://stackoverflow.com/a/309814/214545 that having to do a hack like this implies something is wrong. But I don't know what.
EDIT:
This is the explain for the type='story'
Limit (cost=72553.07..72553.12 rows=20 width=1255)
-> Sort (cost=72553.07..72561.25 rows=3271 width=1255)
Sort Key: id DESC
-> Bitmap Heap Scan on hn_items (cost=814.59..72466.03 rows=3271 width=1255)
Recheck Cond: ((by)::text = 'rbanffy'::text)
Filter: ((type)::text = 'story'::text)
-> Bitmap Index Scan on hn_items_by_index (cost=0.00..813.77 rows=19361 width=0)
Index Cond: ((by)::text = 'rbanffy'::text)
EDIT: EXPLAIN (ANALYZE,BUFFERS)
Limit (cost=0.56..59510.10 rows=20 width=1255) (actual time=20.856..545.282 rows=20 loops=1)
Buffers: shared hit=21597 read=2658 dirtied=32
-> Index Scan using hn_items_pkey on hn_items (cost=0.56..47780210.70 rows=16058 width=1255) (actual time=20.855..545.271 rows=20 loops=1)
Filter: (((by)::text = 'rbanffy'::text) AND ((type)::text = 'comment'::text))
Rows Removed by Filter: 46798
Buffers: shared hit=21597 read=2658 dirtied=32
Planning time: 0.173 ms
Execution time: 545.318 ms
EDIT: EXPLAIN (ANALYZE,BUFFERS) of type='story'
Limit (cost=72553.07..72553.12 rows=20 width=1255) (actual time=44.121..44.127 rows=20 loops=1)
Buffers: shared hit=20137
-> Sort (cost=72553.07..72561.25 rows=3271 width=1255) (actual time=44.120..44.123 rows=20 loops=1)
Sort Key: id DESC
Sort Method: top-N heapsort Memory: 42kB
Buffers: shared hit=20137
-> Bitmap Heap Scan on hn_items (cost=814.59..72466.03 rows=3271 width=1255) (actual time=6.778..37.774 rows=11630 loops=1)
Recheck Cond: ((by)::text = 'rbanffy'::text)
Filter: ((type)::text = 'story'::text)
Rows Removed by Filter: 12587
Heap Blocks: exact=19985
Buffers: shared hit=20137
-> Bitmap Index Scan on hn_items_by_index (cost=0.00..813.77 rows=19361 width=0) (actual time=3.812..3.812 rows=24387 loops=1)
Index Cond: ((by)::text = 'rbanffy'::text)
Buffers: shared hit=152
Planning time: 0.156 ms
Execution time: 44.422 ms
EDIT: latest test results
I was playing around with the type='comment'
query and noticed if changed the limit to a higher number like 100, it used the by
index. I played with the values until I found the critical number was '47'. If I had a limit of 47, the by
index was used, if I had a limit of 46, it was a full scan. I assume that number isn't magical, just happens to be the threshold for my dataset or some other variable I don't know. I'm don't know if this helps.