What sort of Index for 'AND' columns?

Question

I have a table to store people and want to select where the person is not marked as "deleted". I have a clustered primary key on the ID column (PersonID).

The 'Deleted' column is a DATETIME, nullable, and is populated when deleted.

My query looks like this:

SELECT *
FROM dbo.Person
WHERE PersonID = 100
AND Deleted IS NULL

This table can grow to around 40,000 people.
Should I have an index that covers the Deleted flag as well?

I may also query things like:

SELECT *
FROM Task t
INNER JOIN Person p
    ON p.PersonID = t.PersonID
    AND p.Deleted IS NULL
WHERE t.TaskTypeId = 5
AND t.Deleted IS NULL

Task table estimate is about 1.5 million rows.

I think I need one that covers both the pk and the deleted flag on both tables? i.e. on (Task.TaskId, Task.Deleted) and (Person.PersonID, Person.Deleted)?

Reasons for me investigating an index rethink, is due to a number of deadlock occurring in complex procedures. I'd like to reduce the number of rows locked on selects/writes/updates, as well as get a performance gain.

Good way - look at query execution plan. Sometimes, SQL Server gives very good recomendations about creating indexes based on his statistics. So, you can just find missing indexes. http://stackoverflow.com/questions/7359702/how-do-i-obtain-a-query-execution-plan http://sqlperformance.com/2013/06/t-sql-queries/missing-index — Backs, Aug 14 '15 at 05:07
An important metric is the stats and distribution of `deleted` columns. How many `deleted IS NULL` records do you have in each of your tables. If you always search for `deleted is null`, can can create a filtered index with this filter. Other than that, Do you query your deleted records? Do you need your deleted records to remain in the same table table. can it be moved to an archive table. these are some of the things you can consider. — ughai, Aug 14 '15 at 05:09
Thanks @ughai - All queries have a 'Deleted IS NULL' check, including JOINS to that table. At the moment, 97% of the records are NULL (i.e. Not deleted). We never query 'WHERE Deleted IS NOT NULL'. I'm going to read what a 'filtered index' is, as that seems to be something useful. — Craig, Aug 14 '15 at 05:13
Why is `PersonID` not clustered? You wouldn't need the indices if it were. In any case, 40 thousand rows is still in the realm when the engine might still decide not to use the index anyway - depending on the size of your row, of course. The second query is the more important one - you should focus on the filter that eliminates the most rows; it's not obvious what that is here. — Luaan, Aug 14 '15 at 05:25
My error, sorry. The PK on Person (PersonID) is indeed a clustered index. I'll edit that error. — Craig, Aug 14 '15 at 05:34

score 2 · Answer 1 · edited Aug 14 '15 at 05:12

2

Since you are using SQL Server 2008, the fastest querying might well be using a filtered index. In this Deleted column whose type is DATETIME and nullable, you could try something like this index:

CREATE NONCLUSTERED INDEX Filtered_Deleted_Index
ON dbo.Person(Deleted)
WHERE Deleted IS NOT NULL

This will get you the smallest valid set in both use cases you listed above (for querying dbo.Person and also joining with Tasks).

edited Aug 14 '15 at 05:12

marc_s

732,580
175
1,330
1,459

answered Aug 14 '15 at 05:12

Emacs User

1,457
1
12
19

Since the I doubt your index which help much for this query(apart from the filter that is) since your primary index column is `Deleted` column and not on `PersonID` which is part of the join. – ughai Aug 14 '15 at 05:31
Completely agree with ughai, another point is selectivity of deleted column is very low and does not cover all column which OP required – Neeraj Prasad Sharma Aug 14 '15 at 05:40
Actually, this filtered index cannot help at all to find records where `Deleted IS NULL` when the index only track records where `Deleted IS NOT NULL`. – Andreas Aug 14 '15 at 06:12
@Andreas, the filter can be adjusted to have IS NOT NULL or IS NULL based on data profile, which we don't know precisely. But the idea is to have a filtered index. – Emacs User Aug 14 '15 at 12:45
@ughai, this index does not replace other indexes, the pk, or the clustered index that OP may have. This index is an additional one that will serve just those two use cases the OP wanted. I can see your doubts when you confuse a filtered index with primary index. – Emacs User Aug 14 '15 at 12:49
By primary i mean the first index key in your index which is deleted. Once a filter is already applied there is no need to have a delete index key. You need to have personid in the index key to help facilitate the join. – ughai Aug 14 '15 at 12:57
@ughai my filtered index is not for the join but for filtering the fastest for both use cases, one of which does not use join. – Emacs User Aug 14 '15 at 13:00

Bohemian · Answer 2 · 2015-08-14T05:03:59.367

Your instinct is (generally speaking) sound - an index that contains all columns needed for the query is called a covering index, which in this case would require:

CREATE INDEX Person_PersonID_Deleted ON Person(PersonID, Deleted);

You are unlikely to get much performance benefit on index lookup by adding the Deleted column, since searching for null is (usually) ignored, but having these indexes means that accessing the table can be bypassed entirely for Person.

You could also try creating:

CREATE INDEX Task_TaskTypeId_Deleted ON Task(TaskTypeId, Deleted);

which will avoid accessing Task rows that are marked as "deleted", and Task would then only accessed for non-deleted rows. However, if most of your Tasks are not deleted, I wouldn't bother with this index.

It's worth trying out various combinations of index(es) to see which combination gives the best result.

Thanks @Bohemian - Would I also have one on PersonID only? Or would the covering index be suitable (With PersonID, Deleted)? — Craig, Aug 14 '15 at 05:02
You would definitely keep the primary key, which is implicitly backed by an index, on PersonID only, to enforce the uniqueness of the value. — Andreas, Aug 14 '15 at 05:45

score 1 · Answer 3 · answered Aug 14 '15 at 05:44

1

Since the primary key is PersonID, adding another index with extra columns after PersonID will not improve the "selectability" of the index, although is may prevent the need to lookup the record by rowid for filtering on deleted. With only 3% records filtered, that's nothing, so don't create another index on Person.

As for the Task table, it very much depends on the selectability of TaskTypeId = 5 AND Deleted IS NULL, i.e. how many records match the criteria. In general, a sequential search (full table scan) is faster than an index scan with row lookup if more than 20% of the records are selected. For very larger tables where the data is very distributed (e.g. physically every 10th record is selected), the performance threshold is below 10%.

So, if more than 10-20% of Task records are type 5, and only 3% of records are deleted, no index will improve performance, because the fastest access plan is likely a merge join of two full table scans.

answered Aug 14 '15 at 05:44

Andreas

154,647
11
152
247

It's interesting that you mentioned 'Lockup'. The issue that's caused our re-think of indexing is due to a number of deadlocks occurring. So we're attempting to reduce the number of rows locked, as well as speed. – Craig Aug 14 '15 at 05:48
If you're having deadlock issues, you might want to consider turning snapshot mode on, since that eliminates the need for queries to obtain record locks. – Andreas Aug 14 '15 at 06:14
In general, a sequential search (full table scan) is faster than an index scan with row lookup if more than 20% of the records are selected. "DO you mean by index seek with row/key lookup" – Neeraj Prasad Sharma Aug 14 '15 at 06:17
@Neeraj Yes, index scan/seek on the indexed columns with a nested loop to lookup the row by rowid (unclustered) or by key (clustered). – Andreas Aug 14 '15 at 06:21

What sort of Index for 'AND' columns?

3 Answers3