6

Assuming that a table contains sufficient information to warrant an index seek, at what cardinality will SQL Server (or PostgreSQL) opt for an index scan?

The reason I ask this is I previously posted a question (link) in which two queries performed at the same speed, yet one didn't attempt to use the index on the processed columns. After SQL Server suggested I put a covering index that included the columns being queried (it suggested this for both queries), I started looking for reasons as to why it would make such a strange suggestion.

I experimented with making the indexes covering and composite, but both executed in the same time (we're talking 3 million rows).

Finally I concluded it was because of the ultra-high cardinality of the data. Every row is unique. I am deducing this caused SQL server to choose an index scan. However, the query stated "WHERE Col1 > ? AND Col2 < ?", so this is a little confusing.

My questions are:

  1. At what cardinality will a RDBMS always opt for an index scan?
  2. Can anyone explain why SQL Server would not use the index when the WHERE statement would indicate this would make sense?

I have attached the execution plan. alt text

Community
  • 1
  • 1
IamIC
  • 17,747
  • 20
  • 91
  • 154

2 Answers2

6

In terms of SQL Server, this has been referred to as the tipping point, of which Kimberley's blog post is a good read on it. http://www.sqlskills.com/BLOGS/KIMBERLY/category/The-Tipping-Point.aspx

The tipping point is a guideline of 25%-33% of the total number of pages within the table, expressed as rows, e.g. 10k data pages would give a tipping point of 2500-3333 rows. As guidelines go this is pretty good, and as good as you will get - remember the query plan engine is a black box, and whilst it will give you a query plan, it only says what it decided, not why.

In terms of tipping a covering index though, that is not actually very easy, even with 100% of the data being selected a covering index will still seek over scan in the majority of cases.

That makes sense, if you consider that the cost optimizer doesn't assign any real cost to the index pages hierarchy, any only costs up the access to the leaf pages of the index. At that point, scanning or seeking 100% of a covering index is costed the same.

I found from my own experimentation (http://sqlfascination.com/2009/11/07/can-a-covering-nc-index-be-tipped ) using a between clause would cause it to scan, but other where clauses would not - from what I could tell it was purely down to the route through the query engine.

Andrew
  • 26,629
  • 5
  • 63
  • 86
  • Great answer @Andrew. That clears it up for me nicely, and explains why SQL Server chose to scan the index. – IamIC Jan 02 '11 at 17:23
  • @Andrew: "In terms of tipping a covering index though, that is not actually very easy, even with 100% of the data being selected a covering index will still seek over scan in the majority of cases" - why is this? – IamIC Jan 02 '11 at 17:34
  • The query plan engine is a cost based optimmizer, given the access of the index hierarchy is costed as 0, seeking every leaf page in the index, is the same cost as scanning every leaf page in the index (in cost terms). Depending on the where clause used I have seen it do both, but it took considerable effort to get it to scan, the default was seek – Andrew Jan 02 '11 at 17:37
  • @Andrew Ok. The how does one know when best to use a covering index constructed: join_column include (comparison columns 1 - n), vs. a composite index constructed: (join_column, comparison columns 1 - n) given that the latter is prone to tipping, but has the advantage of index components on the comparison columns? – IamIC Jan 02 '11 at 17:49
  • 1
    If I knew I was using a column in some query criteria, I would put it in the indexed section not the include section. But really, if the index is covering, it just is very unlikely to tip, only a between style clause seemed to do it for me, when selecting 100%, the rest of the time, a covering index would not tip. – Andrew Jan 02 '11 at 18:05
  • Thanks. I guess the only way to really see which would be optimum is test both with real data. Obviously tipping comes at a performance cost, as does not using an index on a field being compared. I'm mid your article... you mention BETWEEN causing a tip. What about IN()? – IamIC Jan 02 '11 at 18:10
  • From memory, it did not tip it, an In clause gets converts to a set of OR clauses – Andrew Jan 02 '11 at 18:14
  • So anything equating to equality seems to prevent tipping, and anything relating to ranges (>, <, between) seem to allow tipping. Does that seem like an accurate summation? – IamIC Jan 02 '11 at 18:24
  • if the range ended up being 100% of the data yes, at which point seek vs scan was pretty immaterial. If I was using a between for a smaller amount of data it did not tip. – Andrew Jan 02 '11 at 18:25
3

In PostgreSQL, this is usually not a good question to ask because the actual plan selection is more complicated. It depends on table size, memory settings, and other parts of the query. You will usually get a plain index scan only if you are selecting very few rows. Above that, you will get a bitmap index scan up to say 40% selectivity in simple experiments.

Peter Eisentraut
  • 35,221
  • 12
  • 85
  • 90
  • Thanks @Peter. You mention Bitmap indexes (a descendant from M/Caché). Under what conditions are those used? (low cardinality I'm guessing) – IamIC Jan 02 '11 at 18:40
  • Ps. I'm new to PostgreSQL, but experienced with SQL Server. – IamIC Jan 02 '11 at 18:47
  • A bitmap index scan does not use a bitmap index (which doesn't exist in PostgreSQL). It is a kind of index scan that happens to use some bitmaps along the way. As I wrote above, they are used somewhere between regular index scans and sequential scans. – Peter Eisentraut Jan 02 '11 at 19:04