6

I've been studying indexes and trying to understand how they work and how I can use them to boost performance, but I'm missing something.

I have the following table:

Person:

| Id | Name | Email | Phone |
| 1  | John |  E1   |  P1   |
| 2  | Max  |  E2   |  P2   |

I'm trying to find the best way to index the columns Email and Phone considering that the queries will (most of the time) be of the form

[1] SELECT * FROM Person WHERE Email = '...' OR Phone = '...'
[2] SELECT * FROM Person WHERE Email = ...
[3] SELECT * FROM Person WHERE Phone = ...

I thought the best approach would be to create a single index using both columns:

CREATE NONCLUSTERED INDEX [IX_EmailPhone]
ON [dbo].[Person]([Email], [PhoneNumber]);

However, with the index above, only the query [2] benefits from an index seek, the others use index scan.

I also tried to create multiple index: one with both columns, one for email, and one for email. In this case, [2] and [3] use seek, but [1] continues to use scan.

Why can't the database use index with an or? What would be the best indexing approach for this table considering the queries?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
victor
  • 1,532
  • 1
  • 13
  • 32
  • It also depends on how many rows are returned for each query. If (1) is meant to return 50% of the table, then a scan sounds just fine, because you are using `SELECT *` too, so it needs to get back to the table to get the other columns anyway – Lamak Nov 29 '16 at 19:52
  • My mistake, I'm not actually using *, I just used that to make the code simpler. I'am however selecting all the columns. Will that make such a difference? – victor Nov 29 '16 at 19:54
  • using `SELECT *` is just a shortcut to use `SELECT `, so the argument is the same – Lamak Nov 29 '16 at 19:56
  • Regarding your first comment, it means then that the undex may or may not be used based on statistics? – victor Nov 29 '16 at 20:00
  • Of course it depends on statistics. If the query is getting a lot of rows as a result, using an index and then going back to the table for every row to get the rest of the columns is far slower than just scanning the table in the first place – Lamak Nov 29 '16 at 20:02
  • Right. They are returning only 2 rows though. I tried using the index fro the answer below, and the query filtered only by the phone uses the (phone, email) index. But the query filtered by only the email does not use the (email) index. Both of them return the same number of rows. All of these queries are meant to return a very small number of rows – victor Nov 29 '16 at 20:05
  • but how many rows are in the table? – Lamak Nov 29 '16 at 20:05
  • Currently contanins 822 – victor Nov 29 '16 at 20:07
  • 2
    using an index with that number of rows is pointless. It will almost always better to just do a table scan (unless you were using `SELECT [a column that's in the index]`. There are just too few rows in the table – Lamak Nov 29 '16 at 20:09
  • This table will grow quite fast. What worries me is that the users constantly perform an action that requiries these queries to run, so they must really be as fast as they can, that's why I decided to add the indexes now. Also, there are many places where a query for the email is necessary, scaning the table all the time seems pretty wasteful, specially when comparing a big string. – victor Nov 29 '16 at 20:13
  • the optimizer will start using the index once it's necessary, right now there's no point. – Lamak Nov 29 '16 at 20:14
  • But then why is the index already being used when I look only for a phone, and not used when I look only for an email, considering the indexes from the answer? That's what Im not understanding – victor Nov 29 '16 at 20:16
  • Does this answer your question? [Best index(es) to use for an OR Statement in SQL Server](https://stackoverflow.com/questions/3054642/best-indexes-to-use-for-an-or-statement-in-sql-server) – lcnicolau Sep 14 '21 at 21:26

2 Answers2

1

Use two separate indexes, one on (email) and one on (phone, email).

The OR is rather difficult. If your conditions were connected by AND rather than OR, then your index would be used for the first query (but not the third, because phone is not the first key in the index).

You can write the query as:

SELECT *
FROM Person 
WHERE Email = '...' 
UNION ALL
SELECT *
FROM Person 
WHERE Email <> '...' AND Phone = '...';

SQL Server should use the appropriate index for each subquery.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Just tried it. Still getting a scan for [...] WHERE Email = ... Also, can you explain the rationale behind your index choice? Also your query resulted in 2 scans – victor Nov 29 '16 at 19:57
  • 1
    @victor . . . How large are the tables? SQL Server is not going to use indexes unless the data spans multiple data pages. – Gordon Linoff Nov 30 '16 at 02:49
1

Create a separate index for each column.
By using hints we can force the optimizer to use/not use the indexes, so you can check the execution plan, get a feeling of the performance involved and understand the meaning of each path.

Go through my demo and consider the work involved in each path for the following scenarios -

  1. Only few rows satisfy the condition j=123.
    Only few rows satisfy the condition k=456.

  2. Most of the rows satisfy the condition j=123.
    Most of the rows satisfy the condition k=456.

  3. Only few rows satisfy the condition j=123.
    Most of the rows satisfy the condition k=456.

Try to think what path you would have chosen for each scenario.
Please feel free to ask questions.


Demo

;with t(n) as (select 0 union all select n+1 from t where n < 999)

select      1+t0.n+1000*t1.n                                as i
           ,floor(rand(cast (newid() as varbinary))*1000)   as j
           ,floor(rand(cast (newid() as varbinary))*1000)   as k 

into        t

from        t t0,t t1 

option       (maxrecursion 0)
;

create index t_j on t (j);
create index t_k on t (k);

update statistics t (t_j)
update statistics t (t_k)

Scan

select      *
from        t (forcescan)
where       j = 123
        or  k = 456
  • This is straightforward.

enter image description here

Seek

select      *
from        t (forceseek)
where       j = 123
        or  k = 456
  • "Index Seek": Each index is being seeked for the relevant values (123 and 456)
  • "Merge Join": The results (row IDs) are being concatenated (as in UNION ALL)
  • "Stream Aggregate": Duplicate row IDs are being eliminated
  • "Rid Lookup" + "Nested Loops": The row IDs are being used to retrieve the rows from the table (t)

enter image description here

David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88