169

How do I force Postgres to use an index when it would otherwise insist on doing a sequential scan?

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
mike
  • 46,876
  • 44
  • 102
  • 112
  • Duplicated, see http://stackoverflow.com/questions/14554302/postgres-query-optimization-forcing-an-index-scan – Grigory Kislin Apr 15 '17 at 12:10
  • 1
    +1 I would love to see this feature. It's not a matter of simply disabling seq scan, as other answers say: we need the ability to force PG to use a **specific index**. This is because in the real word stats can be [completely wrong](https://dba.stackexchange.com/q/260817/66756) and at that point you need to use unreliable / partial workarounds. I agree that in simple cases you should first check the indexes and other settings, but for reliability and advanced uses on big data we need this. – collimarco Feb 28 '20 at 19:17
  • 1
    MySQL and Oracle both have it... Not sure why Postgres' planner is so unreliable. – Kevin Parker Mar 12 '20 at 23:27

9 Answers9

141

Assuming you're asking about the common "index hinting" feature found in many databases, PostgreSQL doesn't provide such a feature. This was a conscious decision made by the PostgreSQL team. A good overview of why and what you can do instead can be found here. The reasons are basically that it's a performance hack that tends to cause more problems later down the line as your data changes, whereas PostgreSQL's optimizer can re-evaluate the plan based on the statistics. In other words, what might be a good query plan today probably won't be a good query plan for all time, and index hints force a particular query plan for all time.

As a very blunt hammer, useful for testing, you can use the enable_seqscan and enable_indexscan parameters. See:

These are not suitable for ongoing production use. If you have issues with query plan choice, you should see the documentation for tracking down query performance issues. Don't just set enable_ params and walk away.

Unless you have a very good reason for using the index, Postgres may be making the correct choice. Why?

  • For small tables, it's faster to do sequential scans.
  • Postgres doesn't use indexes when datatypes don't match properly, you may need to include appropriate casts.
  • Your planner settings might be causing problems.

See also this old newsgroup post.

jpmc26
  • 28,463
  • 14
  • 94
  • 146
Patryk Kordylewski
  • 1,263
  • 1
  • 7
  • 11
  • 4
    Agreed, Forcing postgres to do it your way usually means you've done it wrong. 9/10 Times the planner will beat anything you can come up with. The other 1 time its because you made it wrong. – Kent Fredric Nov 21 '08 at 19:31
  • I think it is a good idea for checking really operator classes of your index hold. – metdos Sep 13 '12 at 06:55
  • 4
    I hate to revive an old question but I see often in Postgres documentation, discussions and here, but is there a generalized concept for what qualifies for a *small table*? Is it something like 5000 rows, or 50000 etc? – waffl Jul 22 '14 at 08:46
  • 2
    @waffl Have you considered benchmarking? Create a simple table with an index and an accompanying function for filling it up with *n* rows of random junk. Then start looking at the query plan for different values of *n*. When you see it start using the index, you should have a ballpark answer. You can also get sequential scans if PostgreSQL determines (based on statistics) that an index scan isn't going to eliminate very many rows, too. So benchmarking is always a good idea when you have real performance concerns. As an off-hand, anecdotal guess, I'd say a couple thousand is usually "small." – jpmc26 Sep 09 '14 at 05:47
  • 49
    With over 30 years of experience on platforms such as Oracle, Teradata and MSSQL, I find the optimizer of PostgreSQL 10 not especially smart. Even with up-to-date statistics it generates less efficient execution plans than forced in a special direction. Providing structural hints to compensate these issues would provide a solution to allow PostgreSQL to grow in more market segments. IMHO. – Guido Leenders Oct 19 '18 at 12:39
115

Probably the only valid reason for using

set enable_seqscan=false

is when you're writing queries and want to quickly see what the query plan would actually be were there large amounts of data in the table(s). Or of course if you need to quickly confirm that your query is not using an index simply because the dataset is too small.

Niraj Bhawnani
  • 158
  • 1
  • 3
  • 8
  • 51
    this short reply actually gives a good hint for testing purposes – dwery Mar 03 '14 at 18:26
  • 4
    No one is answering the question! – Ivailo Bardarov Apr 25 '14 at 09:39
  • 1
    @IvailoBardarov The reason all these other suggestions are here is because PostgreSQL doesn't have this feature; this was a conscious decision made by the developers based on how it's typically used and the long term problems it causes. – jpmc26 Sep 09 '14 at 05:50
  • A nice trick to test: run `set enable_seqscan=false`, run your query, and then quickly run `set enable_seqscan=true` to return postgresql to its proper behaviour (and obviously don't do this in production, only in development!) – Brian Hellekin Feb 23 '18 at 16:20
  • 11
    @BrianHellekin Better, `SET SESSION enable_seqscan=false` to only affect yourself – Izkata Nov 14 '19 at 16:28
  • 2
    SESSION is the default so it's equivalent to set enable_seqscan=false – Pascal Heraud Dec 18 '20 at 14:38
46

TL;DR

Run the following three commands and check whether the problem is fixed:

ANALYZE;
SET random_page_cost = 1.0;
SET effective_cache_size = 'X GB';    # replace X with total RAM size minus 2 GB

Read on for further details and background information about this.

Step 1: Analyze tables

As a simple first attempt to fix the issue, run the ANALYZE; command as the database superuser in order to update all table statistics. From the documentation:

The query planner uses these statistics to help determine the most efficient execution plans for queries.

Step 2: Set the correct random page cost

Index scans require non-sequential disk page fetches. PostgreSQL uses the random_page_cost configuration parameter to estimate the cost of such non-sequential fetches in relation to sequential fetches. From the documentation:

Reducing this value [...] will cause the system to prefer index scans; raising it will make index scans look relatively more expensive.

The default value is 4.0, thus assuming an average cost factor of 4 compared to sequential fetches, taking caching effects into account. However, if your database is stored on an SSD drive, then you should actually set random_page_cost to 1.1 according to the documentation:

Storage that has a low random read cost relative to sequential, e.g., solid-state drives, might also be better modeled with a lower value for random_page_cost, e.g., 1.1.

Also, if an index is mostly (or even entirely) cached in RAM, then an index scan will always be significantly faster than a disk-served sequential scan. The query planner however doesn't know which parts of the index are already cached, and thus might make an incorrect decision.

If your database indices are frequently used, and if the system has sufficient RAM, then the indices are likely to be cached eventually. In such a case, random_page_cost can be set to 1.0, or even to a value below 1.0 to aggressively prefer using index scans (although the documentation advises against doing that). You'll have to experiment with different values and see what works for you.

As a side note, you could also consider using the pg_prewarm extension to explicitly cache your indices into RAM.

You can set the random_page_cost like this:

SET random_page_cost = 1.0;

Step 3: Set the correct cache size

On a system with 8 or more GB RAM, you should set the effective_cache_size configuration parameter to the amount of memory which is typically available to PostgreSQL for data caching. From the documentation:

A higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used.

Note that this parameter doesn't change the amount of memory which PostgreSQL will actually allocate, but is only used to compute cost estimates. A reasonable value (on a dedicated database server, at least) is the total RAM size minus 2 GB. The default value is 4 GB.

You can set the effective_cache_size like this:

SET effective_cache_size = '14 GB';   # e.g. on a dedicated server with 16 GB RAM

Step 4: Fix the problem permanently

You probably want to use ALTER SYSTEM SET ... or ALTER DATABASE db_name SET ... to set the new configuration parameter values permanently (either globally or per-database). See the documentation for details about setting parameters.

Step 5: Additional resources

If it still doesn't work, then you might also want to take a look at this PostgreSQL Wiki page about server tuning.

emkey08
  • 5,059
  • 3
  • 33
  • 34
  • 5
    I even had to set random_page_cost = 0.1 in order to make index scan work on large (~600M rows table) in Pg 10.1 on Ubuntu. Without the tweak, seq scan (despite being parallel) was taking 12 mins (Note that Analyze table was performed!). Drive is SSD. After the tweak, exec time became 1 second. – Anatoly Alekseev Jan 12 '19 at 13:02
  • 2
    You saved my day. I was going crazy trying to figure out how the exact same query on the same database was taking 30 seconds on one machine and less than 1 on another, even after running analyze on both ends... To whom it may concern : the command 'ALTER SYSTEM SET random_page_cost=x' sets the new default value globally. – Julien Jul 14 '20 at 20:30
27

Sometimes PostgreSQL fails to make the best choice of indexes for a particular condition. As an example, suppose there is a transactions table with several million rows, of which there are several hundred for any given day, and the table has four indexes: transaction_id, client_id, date, and description. You want to run the following query:

SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
      description = 'Refund'
GROUP BY client_id

PostgreSQL may choose to use the index transactions_description_idx instead of transactions_date_idx, which may lead to the query taking several minutes instead of less than one second. If this is the case, you can force using the index on date by fudging the condition like this:

SELECT client_id, SUM(amount)
FROM transactions
WHERE date >= 'yesterday'::timestamp AND date < 'today'::timestamp AND
      description||'' = 'Refund'
GROUP BY client_id
Ezequiel Tolnay
  • 4,302
  • 1
  • 19
  • 28
  • 3
    Nice idea. However, when we disable current index usage with this method - postgresql query optimizer fallbacks to next suitable index. Thus, no guarantee that optimizer will choose `your_wanted_index`, it can be so that postgresql engine will just perform a sequence / primary key scan instead. Conclusion - there is no 100% reliable method to force some index usage for PostgreSql server. – Agnius Vasiliauskas May 17 '18 at 06:42
  • What if there is no `where` condition but two tables or joined and Postgres fails to take the index. – Surya Dec 02 '19 at 11:56
  • @Surya the above applies to both WHERE and to JOIN ... ON conditions – Ezequiel Tolnay Dec 03 '19 at 21:54
12

The question on itself is very much invalid. Forcing (by doing enable_seqscan=off for example) is very bad idea. It might be useful to check if it will be faster, but production code should never use such tricks.

Instead - do explain analyze of your query, read it, and find out why PostgreSQL chooses bad (in your opinion) plan.

There are tools on the web that help with reading explain analyze output - one of them is explain.depesz.com - written by me.

Another option is to join #postgresql channel on freenode irc network, and talking to guys there to help you out - as optimizing query is not a matter of "ask a question, get answer be happy". it's more like a conversation, with many things to check, many things to be learned.

10

One thing to note with PostgreSQL; where you are expecting an index to be used and it is not being used, is to VACUUM ANALYZE the table.

VACUUM ANALYZE schema.table;

This updates statistics used by the planner to determine the most efficient way to execute a query. Which may result in the index being used.

Another thing to check is the types.

Is the index on an int8 column and you are querying with numeric? The query will work but the index will not be used.

Vesanto
  • 516
  • 11
  • 19
4

There is a trick to push postgres to prefer a seqscan adding a OFFSET 0 in the subquery

This is handy for optimizing requests linking big/huge tables when all you need is only the n first/last elements.

Lets say you are looking for first/last 20 elements involving multiple tables having 100k (or more) entries, no point building/linking up all the query over all the data when what you'll be looking for is in the first 100 or 1000 entries. In this scenario for example, it turns out to be over 10x faster to do a sequential scan.

see How can I prevent Postgres from inlining a subquery?

the_spectator
  • 1,345
  • 11
  • 26
Antony Gibbs
  • 1,321
  • 14
  • 24
1

Indexes can only be used under certain circumstances.

  1. For example the type of the value fits to the type of the column.
  2. You are not doing a operation on the column before comparing to the value.

Given a customer table with 3 columns with 3 indexes on all of the columns.

create table customer(id numeric(10), age int, phone varchar(200))

It might happend that the database trys to use for example the index idx_age instead of using the phone number.

You can sabotage the usage of the index age by doing an operation of age:

 select * from customer where phone = '1235' and age+1 = 24 

(although you are looking for the age 23)

This is of course a very simple example and the intelligence of postgres is probably good enough to do the right choice. But sometimes there is no other way then tricking the system.

Another example is to

select * from customer where phone = '1235' and age::varchar = '23'

But this is probably more costy than the option above.

Unfortunately you CANNOT set the name of the index into the query like you can do in MSSQL or Sybase.

select * from customer (index idx_phone) where phone = '1235' and age = 23.

This would help a lot to avoid problems like this.

Hasan Tuncay
  • 1,090
  • 2
  • 11
  • 30
0

Apparently there are cases where Postgre can be hinted to using an index by repeating a similar condition twice.

The specific case I observed was using PostGIS gin index and the ST_Within predicate like this:

select *
from address
natural join city
natural join restaurant
where st_within(address.location, restaurant.delivery_area)
and restaurant.delivery_area ~ address.location

Note that the first predicate st_within(address.location, restaurant.delivery_area) is automatically decomposed by PostGIS into (restaurant.delivery_area ~ address.location) AND _st_contains(restaurant.delivery_area, address.location) so adding the second predicate restaurant.delivery_area ~ address.location is completely redundant. Nevertheless, the second predicate convinced the planner to use spatial index on address.location and in the specific case I needed, improved the running time 8 times.

emu
  • 1,597
  • 16
  • 20