237

I've heard that you should put columns that will be the most selective at the beginning of the index declaration. Example:

CREATE NONCLUSTERED INDEX MyINDX on Table1
(
   MostSelective,
   SecondMost,
   Least
)

First off, is what I'm saying correct? If so, am i likely to see large differences in performance by rearranging the order of the columns in my index or is it more of a "nice to do" practice?

The reason I'm asking is because after putting a query through the DTA it recommended that I create an index that had almost all of the same columns in it as an existing index, just in a different order. I was considering just adding the missing columns to the existing index and calling it good. Thoughts?

Abe Miessler
  • 82,532
  • 99
  • 305
  • 486

5 Answers5

243

Look at an index like this:

Cols
  1   2   3
-------------
|   | 1 |   |
| A |---|   |
|   | 2 |   |
|---|---|   |
|   |   |   |
|   | 1 | 9 |
| B |   |   |
|   |---|   |
|   | 2 |   |
|   |---|   |
|   | 3 |   |
|---|---|   |

See how restricting on A first, as your first column eliminates more results than restricting on your second column first? It's easier if you picture how the index must be traversed across, column 1, then column 2, etc...you see that lopping off most of the results in the fist pass makes the 2nd step that much faster.

Another case, if you queried on column 3, the optimizer wouldn't even use the index, because it's not helpful at all in narrowing down the result sets. Anytime you're in a query, narrowing down the number of results to deal with before the next step means better performance.

Since the index is also stored this way, there's no backtracking across the index to find the first column when you're querying on it.

In short: No, it's not for show, there are real performance benefits.

Nick Craver
  • 623,446
  • 136
  • 1,297
  • 1,155
  • 23
    In the picture above, keep in mind that that index would only be beneficial if Column 1 was specified in the query. If your query only specifies Column 2 in the Join or Search Predicate then it would not be beneficial. So order matters there also. Maybe that goes without saying, but wanted to mention it. – CodeCowboyOrg Mar 23 '15 at 16:15
  • 8
    Also keep in mind, suppose your Index is like the picture above, and your query filters on column1 and column2, but column2 is more unique and what you really want to filter on is actually column2, then its more beneficial to just have a index where column 2 is first. This may seem counterintuitive but keep in mind an index is stored on several pages and is a tree with a range of values, while Column 1 above does negate 1/2 the possibilities, the index already knows which index page to go straight to for the Column2 value, it does not necessary need Column 1 to narrow down the set. – CodeCowboyOrg Mar 23 '15 at 17:57
  • 6
    This picture is not an accurate representation of how indexes are structured or navigated. Have submitted an answer rectifying this http://stackoverflow.com/a/39080819/73226 – Martin Smith Aug 22 '16 at 16:05
  • 7
    @MartinSmith I disagree that it is inaccurate. It's is very admittedly *extremely* simplified, which was my intent. Your answer digging into far more detail about the levels is appreciated though, for those that want to dig deeper into it. If you look at your tree image, you'll see what I am illustrating in a *very* simple way. This isn't very unique or even SQL specific; B-tree indexing is pretty common across so many things. – Nick Craver Aug 23 '16 at 10:11
  • @MartinSmith I'd also disagree that it is inaccurate, what you are describing is the standard behaviour of how to arrive at covering index - selectivity is much more important once you are performing range queries as this minimises the number of index pages that the optimiser must scan; this can be significant in large tables with millions of rows – Paul Hatcher Oct 28 '16 at 14:58
  • @PaulHatcher - if you need an index that supports a range seek selectivity is irrelevant you need the leading column to be the one your range is against. Nothing else will give you a range seek so there is no choice to be had. – Martin Smith Oct 28 '16 at 17:51
  • @NickCraver Can we say that using a filter with column 1: index seek, column 2: index scan, column 3: table scan? – yakya May 30 '17 at 12:00
  • 3
    Is having multiple indexes that are ordered differently beneficial? For example A,B,C and B,A,C to help with the different grouping possibilities? – jwrightmail Oct 22 '18 at 17:56
  • 1
    There is no "next step". Think of it this way: All the columns are concatenated together, then the combined search is used against an index with combined columns. – Rick James Aug 26 '20 at 18:41
  • But you are relying on the misleading picture painted by the "extreme simplification" to imply there is some particular benefit to having the most selective column first, there is no such benefit so what is the reason for this answer to exist? – Martin Smith Jun 13 '23 at 18:27
184

The order of columns is critical. Now which order is correct it depends on how you are going to query it. An index can be used to do an exact seek or an range scan. An exact seek is when values for all columns in the index are specified and the query lands exactly on the row is interested in. For seeks the order of columns is irrelevant. A range scan is when only some columns are specified, and in this case when the order becomes important. SQL Server can use an index for a range scan only if the leftmost column is specified, and then only if the next leftmost column is specified, and so on. If you have an index on (A,B,C) it can be used to range scan for A=@a, for A=@a AND B=@b but not for B=@b, for C=@c norB=@b AND C=@c. The case A=@a AND C=@c is mixed one, as in the A=@a portion will use the index, but the C=@c not (the query will scan all B values for A=@a, will not 'skip' to C=@c). Other database systems have the so called 'skip scan' operator that can take some advantage of inner columns in an index when the outer columns are not specified.

With that knowledge in hand you can look at the index definitions again. An index on (MostSelective, SecondMost, Least) will be effective only when MostSelective column is specified. But that being the most selective, the relevance of the inner columns will quickly degrade. Very often you'll find that a better index is on (MostSelective) include (SecondMost, Least) or on (MostSelective, SecondMost) include (Least). Because the inner columns are less relevant, placing low selectivity columns in such right positions in the index makes them nothing but noise for a seek, so it makes sense to move them out of the intermediate pages and keep them only on the leaf pages, for query coverability purposes. In other words, move them to INCLUDE. This becomes more important as the size of Least column increases. The idea is that this index can only benefit queries that specify MostSelective either as an exact value or a range, and that column being the most selective it already restricts the candidate rows to great extent.

On the other hand an index on (Least, SecondMost, MostSelective) may seem a mistake, but it actually quite a powerful index. Because it has the Least column as its outermost query, it can be used for queries that have to aggregate results on low selectivity columns. Such queries are prevalent in OLAP and analysis data warehouses, and this is exactly where such indexes have a very good case going for them. Such indexes actually make excellent clustered indexes, exactly because they organize the physical layout on large chunks of related rows (same Least value, which usually indicate some sort of category or type) and they facilitate analysis queries.

So, unfortunately, there is no 'correct' order. You shouldn't follow any cookie cutter recipe but instead analyze the query pattern you are going to use against those tables and decide which index column order is right.

Kapol
  • 6,383
  • 3
  • 21
  • 46
Remus Rusanu
  • 288,378
  • 40
  • 442
  • 569
  • Is this explanation applicable for Oracle DB? – another Jan 18 '17 at 16:43
  • 1
    @Roizpi Yes it is, basically any relation database with Indexes is working same or very similar way. – Tatranskymedved Jun 15 '17 at 13:40
  • Relative to performance, if I anticipate queries for every combination of the three fields would it better to set up three indexes as follows: 1) `on (MostSelective) include (SecondMost, Least)` 2) `on (SecondSelective) include (MostSelective, Least)` 3) `on (Least) include (MostSelective, SecondMost)` Or 6 indexes as follows: 1) `on (MostSelective, SecondMost, Least)` 2) `on (Least, SecondMost, MostSelective)` 3) `on(SecondMost, Least, MostSelective)` 4) `on(SecondMost, MostSelective, Least)` 5) `on(MostSelective, Least, SecondMost)` 6) `on(Least, MostSelective, SecondMost)` – rdg515 Aug 05 '20 at 00:20
  • In the example above `If you have an index on (A,B,C) it can be used to range scan for A=@a, for A=@a AND B=@b`. How about in WHERE statement, I do `B=@b AND A=@a`. Will that matter? Can Index work for `A=@a AND B=@b` as well as `B=@b AND A=@a`? – Sam Mar 30 '21 at 00:59
  • 2
    @Sam the expressions `A=@a AND B=@b` vs. `B=@b AND A=@a` and basically indistinguishably during execution, so the difference does not matter. – Remus Rusanu Apr 01 '21 at 13:42
73

As Remus says it depends on your workload.

I want to address a misleading aspect of the accepted answer though.

For queries that are performing an equality search on all columns in the index there is no significant difference.

The below creates two tables and populates them with identical data. The only difference is that one has the keys ordered from most to least selective and the other the reverse.

CREATE TABLE Table1(MostSelective char(800), SecondMost TINYINT, Least  CHAR(1), Filler CHAR(4000) null);
CREATE TABLE Table2(MostSelective char(800), SecondMost TINYINT, Least  CHAR(1), Filler CHAR(4000) null);

CREATE NONCLUSTERED INDEX MyINDX on Table1(MostSelective,SecondMost,Least);
CREATE NONCLUSTERED INDEX MyINDX2 on Table2(Least,SecondMost,MostSelective);

INSERT INTO Table1 (MostSelective, SecondMost, Least)
output inserted.* into Table2
SELECT TOP 26 REPLICATE(CHAR(number + 65),800), number/5, '~'
FROM master..spt_values
WHERE type = 'P' AND number >= 0
ORDER BY number;

Now doing a query against both of the tables...

SELECT *
FROM   Table1
WHERE  MostSelective = REPLICATE('P', 800)
       AND SecondMost = 3
       AND Least = '~';

SELECT *
FROM   Table2
WHERE  MostSelective = REPLICATE('P', 800)
       AND SecondMost = 3
       AND Least = '~'; 

... Both of them use an index fine and both are given the exact same cost.

enter image description here

The ASCII art in the accepted answer is not in fact how indexes are structured. The index pages for Table1 are represented below (click the image to open in full size).

enter image description here

The index pages contain rows containing the whole key (in this case there is actually an additional key column appended for the row identifier as the index was not declared as unique but that can be disregarded further information about this can be found here).

For the query above SQL Server doesn't care about the selectivity of the columns. It does a binary search of the root page and discovers that the Key (PPP...,3,~ ) is >=(JJJ...,1,~ ) and < (SSS...,3,~ ) so it should read page 1:118. It then does a binary search of the key entries on that page and locates the leaf page to travel down to.

Altering the index in order of selectivity doesn't affect either the expected number of key comparisons from the binary search or the number of pages that need to be navigated to do an index seek. At best it might marginally speed up the key comparison itself.

Sometimes ordering the most selective index first will make sense for other queries in your workload though.

E.g if the workload contains queries of both the following forms.

SELECT * ... WHERE  MostSelective = 'P'

SELECT * ...WHERE Least = '~'

The indexes above aren't covering for either of them. MostSelective is selective enough to make a plan with a seek and lookups worthwhile but the query against Least isn't.

However this scenario (non covering index seek on subset of leading column(s) of a composite index) is only one possible class of query that can be helped by an index. If you never actually search by MostSelective on its own or a combination of MostSelective, SecondMost and always search by a combination of all three columns then this theoretical advantage is useless to you.

Conversely queries such as

SELECT MostSelective,
       SecondMost,
       Least
FROM   Table2
WHERE  Least = '~'
ORDER  BY SecondMost,
          MostSelective 

Would be helped by having the reverse order of the commonly prescribed one - as it covers the query, can support a seek and returns rows in the desired order to boot.

So this is an often repeated piece of advice but at most it's a heuristic about the potential benefit to other queries - and it is no substitute for actually looking at your workload.

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • This answer seems like it is the most well-explained of them all, but I am suspicious because the low number of upvotes. I am curious, do you still concur with this explanation? That is, has your understanding of this process changed at all? And if not, do you have any idea as to why this answer is so corroborated here? – Ross Brasseaux Aug 25 '20 at 21:35
  • @Lopsided - No the answer is correct. It was posted 6 years after others on the page so goes some way towards explaining the vote discrepancy – Martin Smith Aug 26 '20 at 05:44
  • Thanks. And just to correct a typo, I meant to say “...the answer is so *uncorroborated* here?” – Ross Brasseaux Aug 26 '20 at 16:18
34

you should put columns that will be the most selective at the beginning of the index declaration.

Correct. Indexes can be composites - composed of multiple columns - and the order is important because of the leftmost principle. Reason is, that the database checks the list from left to right, and has to find a corresponding column reference matching the order defined. For example, having an index on an address table with columns:

  • Address
  • City
  • State

Any query using the address column can utilize the index, but if the query only has either city and/or state references - the index can not be used. This is because the leftmost column isn't referenced. Query performance should tell you which is optimal - individual indexes, or multiple composites with different orders. Good read: The Tipping Point, by Kimberley Tripp

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
  • What if it was only the rightmost column that wasn't being used? So a query used Address and city, but NOT state. Would the index be used then? – Abe Miessler Feb 18 '10 at 22:56
  • @Abe: Rightmost would not be used - you have to satisfy the index order starting from the left. Miss one, can't use it. – OMG Ponies Feb 18 '10 at 23:05
  • 5
    @Abe: If you queried on Address and city, but NOT state - then yes, the index would be used. In other words, the database is able to use partial indexes to satisfy a request, as long as it's able to start from the left of an index and move to the right in using the fields that are being queried. If, however, you queried using Address and State, but NOT city, it may still use the index, but it will not be as efficient - because now it is only able to use the Address portion of the index (b/c next is city and it's not being used in the query). – JaredC Jun 04 '13 at 17:02
15

Selectivity is a very minor factor; "Leftmost" is critial

Selectivity of the individual columns in a composite index does not matter when picking the order.

Here is the simple thought process: Effectively, an index is the concatenation of the columns involved.

Giving that rationale, the only difference is comparing two 'strings' that differ earlier versus later in the string. This is a tiny part of the total cost. There is no "first pass / second pass", as mentioned in one Answer.

So, what order should be used?

  1. Start with column(s) tested with =, in any order.
  2. Then tack on one range column.

For example, the very-low selectivity column must come first in this:

WHERE deleted = 0  AND  the_datetime > NOW() - INTERVAL 7 DAY
INDEX(deleted, the_datetime)

Swapping the order in the index would have it totally ignore deleted.

(There are a lot more rules for ordering the columns.)

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Is the negative vote because I am wrong? Or because I have a strong opinion? Or something else? – Rick James Feb 06 '19 at 19:55
  • wasn't my downvote, but deleted = 0 to me sounds like it is not low selectivity? I imagine it would be majority of the rows in the table. – Greg Feb 07 '19 at 20:57
  • @Greg - I think that means "low selectivity" -- That is, using `deleted` does not help much in filtering out unwanted rows. Do you have a better example? (That's the one that popped into my mind when I wrote the Answer.) – Rick James Feb 07 '19 at 22:35
  • Misunderstanding on my part. – Greg Feb 07 '19 at 22:42
  • I like your answer, but I think you were downvoted (not for me) because of the sentence "All the answers are wrong". If you think you are the only one correct you need write "All the **other** answers are wrong" :) . Jokes apart, I'm really interested in "There are a lot more rules for ordering the columns", can you give references? – Click Ok Jan 07 '20 at 01:47
  • 1
    @ClickOk - Thanks. My cookbook gives some basic info: http://mysql.rjweb.org/doc.php/index_cookbook_mysql – Rick James Jan 07 '20 at 05:14
  • @RickJames why would it ignore deleted? When traversing through the B+ tree, I can imagine deleted being useful after comparing against date. – wingerse Aug 07 '22 at 09:53
  • 1
    @wingerse - The columns after a range-tested column (`the_datetime`) will be ignored (except for "covering"). Think of it like this: Consider a list of names sorted by last_name, then first_name, with `INDEX(last_name, first_name)`. And `WHERE last_name LIKE 'J%' AND first_name = 'Rick'. It will have to scan all the J's; having `first_name` in the index has very little impact on performance. – Rick James Aug 08 '22 at 04:26
  • @RickJames but why tho? If I were to write my own algorithm, I can make use of deleted after range testing datetime. Is there any reason mysql isn't? – wingerse Aug 08 '22 at 04:29
  • @wingerse - (I updated my Comment while you were typing.) Even if `deleted` is used, it has little performance benefit. `INDEX(first_name, last_name` could have much fewer rows to check -- all the Ricks beginning with J -- no need to skip over any rows in the index. – Rick James Aug 08 '22 at 04:32
  • @RickJames if its using a B+ tree, the search can skip branches where last_name like J% but first_name < 'Rick' or first_name > 'Rick'. So it doesn't need to scan all J%, does it? – wingerse Aug 08 '22 at 04:39
  • Two ranges are also problemenatic. (Which makes "find the nearest" on the globe quite difficult.) – Rick James Aug 08 '22 at 04:44