Optimize TSQL Join

Question

Have a query that takes over a minute to run
Table has over 2 million rows
[sID] is the PK
[textHash] is indexed and allows nulls
Both indexes have less than 1% fragmentation
What I want to to is add [textHash] match to the the basic where
Since [textHash] can be null need the two join conditions

Because of the way other conditions are built up Union on the main select is not an option.
In the next major release changing up the way queries are built to be able to use UNION.

In general I want to join on a property that can be null
If it is null then include the row based on PK

Select top 10001 [docFam].[sID] 
From [docSVsys] with (nolock) 
LEFT OUTER JOIN [docSVsys] as [docFam] with (nolock) 
   On [docSVsys].[sID] = [docFam].[sID] 
   Or [docSVsys].[textHash] = [docFam].[textHash]
Where [docSVsys].[sID]  <=  '1000'
 Group By  [docFam].[sID] 
 Order By  [docFam].[sID] Asc

If some would please tell my how to copy the query plan I will include it

I tried HASH, MERGE, and LOOP hints.
The first two the compiler rejected and the LOOP was slower then no hint.

I tried

On ([docSVsys].[textHash] is null and [docSVsys].[sID] = [docFam].[sID])
Or  [docSVsys].[textHash] = [docFam].[textHash]

And it was slower

A similar query runs in 2 seconds
But in this case [sParID] is not null so I only need the one join condition

Select top 10001 [docFam].[sID] 
From [docSVsys] with (nolock) 
LEFT OUTER JOIN [docSVsys] as [docFam] with (nolock) 
   On [docSVsys].[sParID] = [docFam].[sParID] 
Where [docSVsys].[sID]  <=  '1000'
 Group By  [docFam].[sID] 
 Order By  [docFam].[sID] Asc

For queries that return a small number of rows APPLY works.
This syntax below runs in 1 second versus 1 minute for the syntax above (returns 1,022 rows).

Still have a problem with both forms of the query for conditions that return a lot of rows but I don't view that as a SQL or syntax problem - a lot of rows is going to take longer.

Select [docFam].[sID] 
From [docSVsys] with (nolock)
OUTER APPLY -- cross apply 
  (   
    Select [docSVsysHashNull].[sID] 
    From   [docSVsys] as [docSVsysHashNull]with (nolock) 
    where  [docSVsysHashNull].[sID] = [docSVsys].[sID]
   union
    Select [docSVsysHashNotNull].[sID] 
    From   [docSVsys] as [docSVsysHashNotNull]with (nolock) 
    where  [docSVsysHashNotNull].[sID] != [docSVsys].[sID]
      and  [docSVsysHashNotNull].[textHash] = [docSVsys].[textHash]
  ) as docFam
Where [docSVsys].[sID]  <=  '1000'
Group By [docFam].[sID]
Order By [docFam].[sID] Asc

@chrisb DISTINCT takes a little longer and the meat of the query plan is identical. — paparazzo, Mar 01 '13 at 22:44
How long does it take if you perform only with [docSVsys].[textHash] = [docFam].[textHash] join condition? I guess it takes also about 1m. Maybe clustering index on textHash field rather PK could improve performance. GL! — www, Mar 02 '13 at 01:29
@WawrzyniecSz. Please go away. Table already has a PK - a table can have only one clustered index.. textHash allows duplicates and is not unique - it cannot be a clustered index. — paparazzo, Mar 02 '13 at 15:52
I am away. http://stackoverflow.com/questions/4332982/do-clustered-indexes-have-to-be-unique . BTW GL. — www, Mar 02 '13 at 16:02
Correct me if I'm wrong, but if you want to return list of sids matched by texthash and stuff sid if no match is found, you might just say `select distinct isnull(docFam.sID, docSVsys.sID) as sID from docSVsys left join docSVsys as docFam on docSVsys.textHash = docFam.textHash and docSVsys.sID <> docFam.sID ...` thus eliminating `or`. — Nikola Markovinović, Mar 08 '13 at 01:20
@NikolaMarkovinović You were close. Had to remove the docSVsys.sID <> docFam.sID as that was eliminating the match on itself when there were other matches. If you will post that correction as an answer I will accept. — paparazzo, Mar 08 '13 at 15:05
Oh, I thought that you didn't want to include self-matches. I'm glad if this had helped you. — Nikola Markovinović, Mar 08 '13 at 15:08
@NikolaMarkovinović And your TSQL is right at twice as fast as the UNION — paparazzo, Mar 08 '13 at 15:17

score 1 · Answer 1 · answered Mar 02 '13 at 01:53

1

Did you considered writing it with exists clause. something like this

select distinct top 10001 a.[sID]
From [docSVsys] as a
where exists 
(
select *
From [docSVsys] as b
where b.[sID]<='1000'
and ( a.[sID] = b.[sID] OR a.[textHash] = b.[textHash])
)

answered Mar 02 '13 at 01:53

ClearLogic

3,616
1
23
31

No faster on queries that return few and slower on condition that return a lot. Still +1 it was worth trying. – paparazzo Mar 02 '13 at 17:10

paparazzo · Accepted Answer · 2013-03-10T15:00:11.377

0

The answer was provided in a comment from @NikolaMarkovinović

select distinct isnull(docFam.sID, docSVsys.sID) as sID 
from  docSVsys 
left  join docSVsys as docFam 
  on  docSVsys.textHash = docFam.textHash 
where ...

edited Mar 10 '13 at 15:00

answered Mar 04 '13 at 17:22

paparazzo

44,497
23
105
176

Optimize TSQL Join

2 Answers2