5

I'm working with the HackerNews dataset in Postgres. There are about 17M rows about 14.5M of them are comments and about 2.5M are stories. There is a very active user named "rbanffy" who has 25k submissions, about equal split stories/comments. Both "by" and "type" have separate indices.

I have a query:

SELECT *
FROM "hn_items"
WHERE by = 'rbanffy'
and type = 'story'
ORDER BY id DESC
LIMIT 20 OFFSET 0

That runs quickly (it's using the 'by' index). If I change the type to "comment" then it's very slow. From the explain, it doesn't use either index and does a scan.

Limit  (cost=0.56..56948.32 rows=20 width=1937)
  ->  Index Scan using hn_items_pkey on hn_items  (cost=0.56..45823012.32 rows=16093 width=1937)
        Filter: (((by)::text = 'rbanffy'::text) AND ((type)::text = 'comment'::text))

If I change the query to have type||''='comment', then it will use the 'by' index and executes quickly.

Why is this happening? I understand from https://stackoverflow.com/a/309814/214545 that having to do a hack like this implies something is wrong. But I don't know what.

EDIT:
This is the explain for the type='story'

Limit  (cost=72553.07..72553.12 rows=20 width=1255)
  ->  Sort  (cost=72553.07..72561.25 rows=3271 width=1255)
        Sort Key: id DESC
        ->  Bitmap Heap Scan on hn_items  (cost=814.59..72466.03 rows=3271 width=1255)
              Recheck Cond: ((by)::text = 'rbanffy'::text)
              Filter: ((type)::text = 'story'::text)
              ->  Bitmap Index Scan on hn_items_by_index  (cost=0.00..813.77 rows=19361 width=0)
                    Index Cond: ((by)::text = 'rbanffy'::text)

EDIT: EXPLAIN (ANALYZE,BUFFERS)

Limit  (cost=0.56..59510.10 rows=20 width=1255) (actual time=20.856..545.282 rows=20 loops=1)
  Buffers: shared hit=21597 read=2658 dirtied=32
  ->  Index Scan using hn_items_pkey on hn_items  (cost=0.56..47780210.70 rows=16058 width=1255) (actual time=20.855..545.271 rows=20 loops=1)
        Filter: (((by)::text = 'rbanffy'::text) AND ((type)::text = 'comment'::text))
        Rows Removed by Filter: 46798
        Buffers: shared hit=21597 read=2658 dirtied=32
Planning time: 0.173 ms
Execution time: 545.318 ms

EDIT: EXPLAIN (ANALYZE,BUFFERS) of type='story'

Limit  (cost=72553.07..72553.12 rows=20 width=1255) (actual time=44.121..44.127 rows=20 loops=1)
  Buffers: shared hit=20137
  ->  Sort  (cost=72553.07..72561.25 rows=3271 width=1255) (actual time=44.120..44.123 rows=20 loops=1)
        Sort Key: id DESC
        Sort Method: top-N heapsort  Memory: 42kB
        Buffers: shared hit=20137
        ->  Bitmap Heap Scan on hn_items  (cost=814.59..72466.03 rows=3271 width=1255) (actual time=6.778..37.774 rows=11630 loops=1)
              Recheck Cond: ((by)::text = 'rbanffy'::text)
              Filter: ((type)::text = 'story'::text)
              Rows Removed by Filter: 12587
              Heap Blocks: exact=19985
              Buffers: shared hit=20137
              ->  Bitmap Index Scan on hn_items_by_index  (cost=0.00..813.77 rows=19361 width=0) (actual time=3.812..3.812 rows=24387 loops=1)
                    Index Cond: ((by)::text = 'rbanffy'::text)
                    Buffers: shared hit=152
Planning time: 0.156 ms
Execution time: 44.422 ms

EDIT: latest test results I was playing around with the type='comment' query and noticed if changed the limit to a higher number like 100, it used the by index. I played with the values until I found the critical number was '47'. If I had a limit of 47, the by index was used, if I had a limit of 46, it was a full scan. I assume that number isn't magical, just happens to be the threshold for my dataset or some other variable I don't know. I'm don't know if this helps.

Yehosef
  • 17,987
  • 7
  • 35
  • 56
  • @a_horse_with_no_name - I don't know how to show more of the execution plan - I ran "EXPLAIN " and this was the result. what else should I do? Running analyze didn't change anything. – Yehosef May 29 '18 at 12:19
  • 1
    It might be a problem with the physical ordering of the data. Can we have `EXPLAIN (ANALYZE, BUFFERS)` output? – Laurenz Albe May 29 '18 at 12:22
  • @LaurenzAlbe - added. Note that the PK id is indexed ASC and I'm ordering DESC. But I've tried changing my ordering to ASC and it doesn't change the explain. (I removed the "Backward" after the "Index Scan" to avoid confusion - since it didn't seem to make a difference.) – Yehosef May 29 '18 at 13:51
  • For the record - if I remove the "order by" it uses the right index without a hack. If I change the order by to "by" or "type" instead of "id" it uses the index (but that's not what I want..) When I choose "order by id" (asc||desc) - it doesn't use it. This is all when using the type="comment" - type="story" always works. – Yehosef May 29 '18 at 13:56
  • It seems the optimizer under-estimates the number of rows returned by the query and thinks that the overhead of sorting the rows is bigger than the overhead of using the "wrong" index. Please edit the question and add the definition of the index `hn_items_by_index`. Can you show the output of `explain (analyze, buffers)` for the _fast_ query? –  May 29 '18 at 13:59
  • @a_horse_with_no_name - added. – Yehosef May 29 '18 at 14:46
  • maybe need to run/rerun vacuum analyze (so table stats are updated and the optimizer can make better decisions) another thing to try: set enable_seqscan=false; – weinerk May 29 '18 at 16:58
  • @weinerk - thanks. Tried both, no improvement. – Yehosef May 29 '18 at 20:48

1 Answers1

3

Since there ate many comments by rbanffy, PostgreSQL assumes that it will be fast enough if it searches the table in the order implied by the ORDER BY clause (which can use the primary key index) until it has found 20 rows that match the search condition.

Unfortunately it happens that in the guy has grown lazy lately — at any rate, PostgreSQL has to scan the 46798 highest ids until it has found its 20 hits. (You really shouldn't have removed the Backwards, that confused me.)

The best way to work around that is to confuse PostgreSQL so that it doesn't choose the primary key index, perhaps like this:

SELECT *
FROM (SELECT * FROM hn_items
      WHERE by = 'rbanffy'
        AND type = 'comment'
      OFFSET 0) q
ORDER BY id DESC
LIMIT 20;
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • I wonder if creating extended stats on both columns would help the optimizer in any way –  May 29 '18 at 14:19
  • I doubt it, because the problem here is not the wrong estimate, but the unfortunate distribution of the values. – Laurenz Albe May 29 '18 at 14:28
  • Thanks for the answer - digesting. About the `Backwards`, I could have given the same query using `order by id ASC`, it doesn't print the `Backwards` and it doesn't make it better. So I'm not sure why it should make a difference – Yehosef May 29 '18 at 14:29
  • It probably doesn't make a difference, but any modification of the output is confusing. Without the "backwards", that would mean that the primary key index is ordered descending, which cannot be. – Laurenz Albe May 29 '18 at 14:33
  • I'm not sure if I follow - why would it scan the table instead of first filtering by the `by` field. Going back 50k records it's not a big deal for a table of 17M. And, FWIW, if I add a another where clause `and descendants>250' using `type='story'`, it still performs fast and uses the by index even though it has to scan has to go over 10M rows to get 20 results. – Yehosef May 29 '18 at 14:37
  • The issue is I don't want to use a hack - my hack also worked fine. I'm expecting postgres to be smart enough on a simple query like this to give the right results. I currently assuming that I've made a mistake somewhere, but I'm having a hard time proving that to be true. And I think the `Backwards` is a red herring. If you want me to reindex the database using id DESC and rerun the query I would. Like I said, running the query with `order by id asc` didn't print the backwards and wasn't any better. – Yehosef May 29 '18 at 14:42
  • You misunderstand me. Let's forget about the "backwards", it it unimportant. The problem is that PostgreSQL is not smart enough to know that there are so few matching rows in the "later" portion of the table, so the plan it chooses turns bad. You'll need a hack, or you have to make PostgreSQL smarter. But this will not be easy. You'd have to know at least for the most frequent values how they are physically distributed in the table, and I don't think that there is a reasonable way to store that avalanche of data in the table statistics. Sorry, but it has to be a hack. I tried to explain why. – Laurenz Albe May 29 '18 at 15:57
  • @LaurenzAlbe - I just added more info to the question that if I increase the limit it uses the right index. Does that fit with your understanding of the problem? How could I test your hypothesis? Do you know of any sources that talk these kinds of problems from data distribution? It just seems broken. – Yehosef May 29 '18 at 21:01
  • Yes, that fits in with my hypothesis. If the limit is too high, PostgreSQL will know that it has to scan the index too long to be efficient. You can test my hypothesis by updating the 20 rows with the highest `id`s to match the condition, then it will be fast. Doesn't `Rows Removed by Filter: 46798` convince you? Of course it is somewhat broken because the estimates are bad. If you don't know how to improve it and don't want to work around it, try another database. – Laurenz Albe May 30 '18 at 02:45
  • some more intel that limit does affect planning https://www.postgresql.org/docs/9.6/static/using-explain.html https://dba.stackexchange.com/questions/73614/postgresql-query-plan-differs-with-limit-value-making-the-query-slower-for-lower – weinerk May 30 '18 at 23:23
  • 1
    by the way i think the meaning of the hack with concatenation is just preventing a specific index - by adding a dynamic expression (in your case - prevent index "type") https://www.postgresql.org/docs/9.6/static/indexes-expressional.html` – weinerk May 30 '18 at 23:29