7

I have a very simple db schema, which has a multi column b-tree index on following columns:

PersonId, Amount, Commission

Now, if I try to select the table with following query:

explain select * from "Order" where "PersonId" = 2 AND "Commission" > 3

Pg is scanning the index and the query is very fast, but if I try the following query:

explain select * from "Order" where "PersonId" > 2 AND "Commission" > 3

It does a sequential scan, even when the index is present. Even this query

explain select * from "Order" where "Commission" > 3 

does a sequential scan. Anyone care to explain why? :-)

Thank you very much.

UPDATE

The table contains 100 million rows. I have created it just to test PostgreSQL performance against MS SQL. The table is already VACUUMED. I'm runnning Core I5 2500k quad core cpu and 8 GB of ram.

Here's the result of explain analyze for this query:

explain ANALYZE select * from "Order" where "Commission" BETWEEN 3000000 AND 3000010  LIMIT 20 


Limit  (cost=0.00..2218328.00 rows=1 width=24) (actual time=28043.249..28043.249 rows=0 loops=1)
  ->  Seq Scan on "Order"  (cost=0.00..2218328.00 rows=1 width=24) (actual time=28043.247..28043.247 rows=0 loops=1)
        Filter: (("Commission" >= 3000000::numeric) AND ("Commission" <= 3000010::numeric))
Total runtime: 28043.278 ms
Davita
  • 8,928
  • 14
  • 67
  • 119
  • 3
    look there http://stackoverflow.com/questions/5203755/why-does-postgresql-perform-sequential-scan-on-indexed-column, especially @Frank Heikens answer. – Raphaël Althaus May 13 '12 at 19:11
  • 2
    because the index column order matters. read this: http://use-the-index-luke.com/sql/where-clause/searching-for-ranges/greater-less-between-tuning-sql-access-filter-predicates – Markus Winand May 13 '12 at 20:06
  • Can you show us the `CREATE TABLE` and `CREATE INDEX` statements for the table? The statements you used to populate the table would be fantastic to have, too, if practicable. (`\d "Order"` output from psql would *do*, but it makes it harder for people to duplicate your results and test suggested changes.) I'm assuming the data is fully cached, because otherwise it would be awfully hard to pass 100 million rows in 28 seconds. The default configuration assumes minimal caching, so a plan based on the assumption of disk access will be chosen, which weights against random index access. – kgrittn May 13 '12 at 20:40
  • Also, if you want better performance, one thing you could do would be to use `int` or `bigint` instead of `numeric`. I've seen that make as much as a 5% difference in performance. http://www.postgresql.org/docs/current/interactive/datatype-numeric.html If you don't need the fractional scale available with `numeric`, the maximum of the integer-based types (9223372036854775807) is likely to be adequate for most purposes. – kgrittn May 13 '12 at 20:47
  • @kgrittn I generated the data first in SQL Server using Red Gate's SQL Data Generator, then I transfered the data to PostgreSQL using navicat premium. I also used navicat management tool to create indexes, that's why, I don't have any DDL script at hand :( – Davita May 13 '12 at 21:33
  • In your update, your query predicate is a range of commissions, but commission is the lowest order element of your multi-column index, so the index will be little help at selecting the requested rows. You want either an index on commission alone, or a multi-column index with commission as the first column in the key to usefully aid this query with an index scan. – dbenhur May 13 '12 at 23:42
  • Did you try to add one more index for Commision field only? –  Jul 10 '18 at 20:44

1 Answers1

10

The short answer is that when comparing the various available plans, the sequential scan is expected to be the fastest, based on the costing factors you have configured and the latest statistics available. From what little information you've provided, it seems quite likely that the planner has made the right choice. If you had three single-column indexes, it might be able to use bitmap index scans, particularly if the rows to be selected are less than about 10% of the rows in the table.

Note that with the index you describe, the entire index would need to be scanned from for all rows where "PersonId" > 2; which unless you have a lot of negative values for "PersonId" is very likely to be most of the table.

Also note that if you have a tiny table -- say a few thousand rows or less, accessing the rows through an index will rarely be faster than just scanning those few rows. Plans are sensitive to data volume, and the plan you get with a small number of rows is very unlikely to be the same plan you get with a lot of rows.

If it is, in fact, not picking the fastest plan, the odds are good that you need to adjust your cost factors to better model the costs on your machine. Another possibility is that you need to be more aggressive in your autovacuum settings, to make sure up-to-date statistics are available, or you may need to configure collection of finer-grained statistics.

People will be able to provide more specific advice if you show the table descriptions (including indexes), the EXPLAIN ANALYZE output for the query, and a description of the hardware.

kgrittn
  • 18,113
  • 3
  • 39
  • 47