7

I have a table in PostgreSQL 9.2 that has a text column. Let's call this text_col. The values in this column are fairly unique (may contain 5-6 duplicates at the most). The table has ~5 million rows. About half these rows contain a null value for text_col. When I execute the following query I expect 1-5 rows. In most cases (>80%) I only expect 1 row.

Query

explain analyze SELECT col1,col2.. colN
FROM table 
WHERE text_col = 'my_value';

A btree index exists on text_col. This index is never used by the query planner and I am not sure why. This is the output of the query.

Planner

Seq Scan on two (cost=0.000..459573.080 rows=93 width=339) (actual time=1392.864..3196.283 rows=2 loops=1)
Filter: (victor = 'foxtrot'::text)
Rows Removed by Filter: 4077384

I added another partial index to try to filter out those values that were not null, but that did not help (with or without text_pattern_ops. I do not need text_pattern_ops considering no LIKE conditions are expressed in my queries, but they also match equality).

CREATE INDEX name_idx
  ON table
  USING btree
  (text_col COLLATE pg_catalog."default" text_pattern_ops)
  WHERE text_col IS NOT NULL;

Disabling sequence scans using set enable_seqscan = off; makes the planner still pick the seqscan over an index_scan. In summary...

  1. The number of rows returned by this query is small.
  2. Given that the non-null rows are fairly unique, an index scan over the text should be faster.
  3. Vacuuming and analyzing the table did not help the optimizer pick the index.

My questions

  1. Why does the database pick the sequence scan over the index scan?
  2. When a table has a text column whose equality condition should be checked, are there any best practices I can adhere to?
  3. How do I reduce the time taken for this query?

[Edit - More information]

  1. The index scan is picked up on my local database that houses about 10% of the data that is available in production.
Community
  • 1
  • 1
Deepak Bala
  • 11,095
  • 2
  • 38
  • 49
  • I think there is collation mismatch. Try to specify one: `SELECT col1,col2.. colN FROM table WHERE text_col = 'my_value' COLLATE pg_catalog."default"` – Rimas Sep 25 '14 at 04:35
  • I tried specifying `COLLATE pg_catalog."default"` and it resulted in a sequence scan. – Deepak Bala Sep 25 '14 at 04:48
  • 1
    Table structure (CREATE TABLE statement)? I a) built a test table, b) populated it with 5 million rows, half NULL and half random strings, c) created a partial index on text_col, d) analyzed the table, and e) found one matching row in 0.075 ms. PostgreSQL 9.3 used an index-only scan. – Mike Sherrill 'Cat Recall' Sep 25 '14 at 04:53
  • I use Postgresql 9.2.4. I find it equally puzzling that it would choose a sequence scan. – Deepak Bala Sep 25 '14 at 04:57
  • Since you have been experimenting with disabling planner methods. What do you get for `show enable_indexscan;`? You didn't disable that by accident? – Erwin Brandstetter Sep 25 '14 at 06:02
  • `show enable_indexscan;` returns `on` – Deepak Bala Sep 25 '14 at 06:03
  • [Cost settings? Statistics?](http://stackoverflow.com/questions/8228326/how-can-i-avoid-postgresql-sometimes-choosing-a-bad-query-plan-for-one-of-two-ne/8229000#8229000) Although, if even `set enable_seqscan = off;` didn't show any effect, that can't be it. – Erwin Brandstetter Sep 25 '14 at 06:13
  • Edit your question, and paste the CREATE TABLE statement for your table. I get the same results I got before (see my earlier comment) using PostgreSQL 9.2.9. Selecting a single value results in an index-only scan. Can anybody here reproduce the OP's behavior? – Mike Sherrill 'Cat Recall' Sep 25 '14 at 19:56
  • I've posted an answer that explains why the scan was not picked up. Thank you for taking the time to look into the problem. – Deepak Bala Sep 25 '14 at 20:35
  • I don't see how your answer explains anything. I got index-only scans despite having the "wrong" correlation, and despite not having `text_col is not null` in the WHERE clause. – Mike Sherrill 'Cat Recall' Sep 26 '14 at 01:05
  • My index scan is now picked up without the where condition for `text_col is not null` too. Both indexes are used separately with and without the where clause. My conclusion that ONLY `correlation` makes a difference to the index scan is annulled by your observation that your `correlation` was close to 0 (which goes against the documentation). Re-aligning the Fillfactor and vacuuming the table has helped albeit it's effect on why it forced the planner's hand is now unknown to me. I'll modify my answer to include your observation. – Deepak Bala Sep 26 '14 at 01:18

3 Answers3

13

A partial index is a good idea to exclude half the rows of the table which you obviously do not need. Simpler:

CREATE INDEX name_idx ON table (text_col)
WHERE text_col IS NOT NULL;

Be sure to run ANALYZE table after creating the index. (Autovacuum does that automatically after some time if you don't do it manually, but if you test right after creation, your test will fail.)

Then, to convince the query planner that a particular partial index can be used, repeat the WHERE condition in the query - even if it seems completely redundant:

SELECT col1,col2, .. colN
FROM   table 
WHERE  text_col = 'my_value'
AND   text_col IS NOT NULL;  -- repeat condition

Voilá.

Per documentation:

However, keep in mind that the predicate must match the conditions used in the queries that are supposed to benefit from the index. To be precise, a partial index can be used in a query only if the system can recognize that the WHERE condition of the query mathematically implies the predicate of the index. PostgreSQL does not have a sophisticated theorem prover that can recognize mathematically equivalent expressions that are written in different forms. (Not only is such a general theorem prover extremely difficult to create, it would probably be too slow to be of any real use.) The system can recognize simple inequality implications, for example "x < 1" implies "x < 2"; otherwise the predicate condition must exactly match part of the query's WHERE condition or the index will not be recognized as usable. Matching takes place at query planning time, not at run time. As a result, parameterized query clauses do not work with a partial index.

As for parameterized queries: again, add the (redundant) predicate of the partial index as an additional, constant WHERE condition, and it works just fine.


An important update in Postgres 9.6 largely improves chances for index-only scans (which can make queries cheaper and the query planner will more readily chose such query plans). Related:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • I tried this and ran the statements as suggested, replacing only the table / index and column names. It still picks up the sequence scan. I analyzed the table before running the query. – Deepak Bala Sep 25 '14 at 05:53
  • @DeepakBala: That's odd. I am using hundreds of partial indexes and they work as advertised. – Erwin Brandstetter Sep 25 '14 at 05:59
  • I'm stumped too. I decided to post on SO thinking I was missing something obvious. Let me know if there is something I could have overlooked. I'll post an answer if I get to the bottom of this sooner. – Deepak Bala Sep 25 '14 at 06:02
  • Thanks for your help. I managed to get the index scan picked up and have posted an answer of my own. I'll go ahead and accept your answer since you helped confirm that my approach with partial indexes was not off track. – Deepak Bala Sep 25 '14 at 20:33
  • The bottom of your pg9.6 link is exactly what I came here looking for: _"... the WHERE clause refers to success which is not available as a result column of the index. Nonetheless, an index-only scan is possible because the plan does not need to recheck that part of the WHERE clause at run time: all entries found in the index necessarily have success = true so this need not be explicitly checked in the plan. PostgreSQL versions 9.6 and later will recognize such cases and allow index-only scans to be generated, but older versions will not."_ – lpd Aug 03 '17 at 05:03
1

A partial index is only used if the WHERE conditions match. Thus an index with WHERE text_col IS NOT NULL can only be used if you use the same condition in your SELECT. Collation mismatch could also cause harm.

Try the following:

  1. Make a simplest possible btree index CREATE INDEX foo ON table (text_col)
  2. ANALYZE table
  3. Query
jkj
  • 2,561
  • 1
  • 17
  • 24
  • Tried that. It results in a sequence scan again. – Deepak Bala Sep 25 '14 at 04:52
  • @DeepakBala Tried in SQLfiddle and index seems to be used (http://sqlfiddle.com/#!15/b576c/2). Please `ANALYZE` the whole database once more. For some reason you index is not being used. Could you show `CREATE INDEX` for the new index? – jkj Sep 25 '14 at 05:07
  • I executed the create index exactly as you had suggested in #1. Only the name of the index / table and column were changed. Analyzed the entire database now and the sequence scan is still executed. The results seem awkward because the same table on my local database picks up the index correctly, albeit the size of the table is 10% of what it is in production. – Deepak Bala Sep 25 '14 at 05:17
1

I figured it out. Upon taking a closer look at the pg_stats view that analyze helps build, I came across this excerpt on the documentation.

Correlation

Statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to reduction of random access to the disk. (This column is null if the column data type does not have a < operator.)

On my local box the correlation number is 0.97 and on production it was 0.05. Thus the planner is estimating that it is easier to go through all those rows sequentially instead of looking up the index each time and diving into a random access on the disk block. This is the query I used to peek at the correlation number.

select * from pg_stats where tablename = 'table_name' and attname = 'text_col';

This table also has a few updates performed on its rows. The avg_width of the rows is estimated to be 20 bytes. If the update has a large value for a text column, it can exceed the average and also result in a slower update. My guess was that the physical and logical ordering are slowing moving apart with each update. To fix that I executed the following queries.

ALTER TABLE table_name SET (FILLFACTOR = 80);
VACUUM FULL table_name;
REINDEX TABLE table_name;
ANALYZE table_name;

The idea is that I could give each disk block a 20% buffer and vacuum full the table to reclaim lost space and maintain physical and logical order. After I did this the query picks up the index.

Query

explain analyze SELECT col1,col2... colN
FROM table_name 
WHERE text_col is not null 
AND 
text_col = 'my_value';

Partial index scan - 1.5ms

Index Scan using tango on two (cost=0.000..165.290 rows=40 width=339) (actual time=0.083..0.086 rows=1 loops=1)
Index Cond: ((victor five NOT NULL) AND (victor = 'delta'::text))

Excluding the NULL condition picks up the other index with a bitmap heap scan.

Full index - 0.08ms

Bitmap Heap Scan on two  (cost=5.380..392.150 rows=98 width=339) (actual time=0.038..0.039 rows=1 loops=1)
    Recheck Cond: (victor = 'delta'::text)
  ->  Bitmap Index Scan on tango  (cost=0.000..5.360 rows=98 width=0) (actual time=0.029..0.029 rows=1 loops=1)
          Index Cond: (victor = 'delta'::text)

[EDIT]

While it initially looked like correlation plays a major role in choosing the index scan @Mike has observed that a correlation value that is close to 0 on his database still resulted in an index scan. Changing fill factor and vacuuming fully has helped but I'm unsure why.

Deepak Bala
  • 11,095
  • 2
  • 38
  • 49
  • 1
    The correlation on my 9.2.9 table is .006; on my 9.3 table it's .003. Yet I got index-only scans on both of them. – Mike Sherrill 'Cat Recall' Sep 25 '14 at 23:13
  • On the example submitted by jkj the value was 0.4 when I last checked it. It could be that the planner only uses that number as one of parameters to decide whether it should go for an index scan. That said, executing those queries did change the planner's hand in my case and my conclusion that using `correlation` as the ONLY deciding factor could be a red herring. I stumbled across [cost estimation functions](http://www.postgresql.org/docs/9.2/static/index-cost-estimation.html) within psql that might help explain this behavior, but did not reach far down that road. – Deepak Bala Sep 26 '14 at 00:19