1

Any ideas on how to improve this query performance?

[ftsIndex] The PK is sID, wordPos.
And there is an index on wordID, sID, wordPos.
They are all int.

In the end use a distinct.
Most sID just have a few matches.
Some sID may have over 10,000 matches and kill the query.

Have a query where the first 27,749 rows return in 11 seconds.
No single sID has more than 500 matches.
The sum of the individual matches is 65,615.

The 27,750th row alone takes over 2 minutes and has 15,000 matches.

Not a surprise as the join at the end is on [sID].

Since in the end use distinct is there a way to it to look for the first affirmative

on [wXright].[sID] = [wXleft].[sID]
    and [wXright].[wordPos] >  [wXleft].[wordPos]
    and [wXright].[wordPos] <= [wXleft].[wordPos] + 10

then move to the next sID?

I know this is asking a lot from the query optimizer but that would really be cool.

In real life the problem document is a part lists and the supplier is repeated many times.

select distinct [wXleft].[sID] 
 FROM 
 ( -- begin [wXleft]
   ( -- start term
      select [ftsIndex].[sID], [ftsIndex].[wordPos]
      from [ftsIndex] with (nolock)
      where [ftsIndex].[wordID] in 
              (select [id] from [FTSwordDef] with (nolock) 
                             where [word] like 'Brown') 
   ) -- end term
 ) [wXleft]
 join 
 ( -- begin [wRight]
   ( -- start term
      select [ftsIndex].[sID], [ftsIndex].[wordPos]
      from [ftsIndex] with (nolock)
      where [ftsIndex].[wordID] in 
              (select [id] from [FTSwordDef] with (nolock) 
                             where [word] like 'Fox')
   ) -- end term
 ) [wXright]
 on [wXright].[sID] = [wXleft].[sID]
and [wXright].[wordPos] >  [wXleft].[wordPos]
and [wXright].[wordPos] <= [wXleft].[wordPos] + 10

This brings it down to 1:40

inner loop join

I did it just to try and it totally changed up the query plan.
I don't know how long the problem query takes. I gave up at 20:00.
I am not even going to to post this as an answer as I don't see that it would be of value to anyone else.
Hoping for a better answer.
If I don't get one in the next two days I will just delete the question.

This does not fix it

  select distinct [ft1].[sID]
  from [ftsIndex] as [ft1] with (nolock)
  join [ftsIndex] as [ft2] with (nolock)
    on [ft2].[sID] = [ft1].[sID]
   and [ft1].[wordID] in (select [id] from [FTSwordDef] with (nolock) where [word] like 'brown')
   and [ft2].[wordID] in (select [id] from [FTSwordDef] with (nolock) where [word] like 'fox')
   and [ft2].[wordPos] >  [ft1].[wordPos]
   and [ft2].[wordPos] <= [ft1].[wordPos] + 10

Also support queries like "quick brown" with 10 words of "fox" or "coyote" so joins with aliases is not a good path.

This takes 14 minutes (but at least it runs).
Again this format is not conducive to more advanced queries.

 IF OBJECT_ID(N'tempdb..#tempMatch1', N'U') IS NOT NULL   DROP TABLE #tempMatch1 
 CREATE TABLE #tempMatch1(
    [sID] [int] NOT NULL,
    [wordPos] [int] NOT NULL,
 CONSTRAINT [PK1] PRIMARY KEY CLUSTERED 
(
    [sID] ASC,
    [wordPos] ASC
))
 IF OBJECT_ID(N'tempdb..#tempMatch2', N'U') IS NOT NULL   DROP TABLE #tempMatch2 
 CREATE TABLE #tempMatch2(
    [sID] [int] NOT NULL,
    [wordPos] [int] NOT NULL,
 CONSTRAINT [PK2] PRIMARY KEY CLUSTERED 
(
    [sID] ASC,
    [wordPos] ASC
))
insert into #tempMatch1 
select [ftsIndex].[sID], [ftsIndex].[wordPos]
      from [ftsIndex] with (nolock)
      where [ftsIndex].[wordID] in 
              (select [id] from [FTSwordDef] with (nolock) 
                             where [word] like 'Brown')
        --and [wordPos] < 100000; 
   order by [ftsIndex].[sID], [ftsIndex].[wordPos]                      
insert into #tempMatch2 
select [ftsIndex].[sID], [ftsIndex].[wordPos]
      from [ftsIndex] with (nolock)
      where [ftsIndex].[wordID] in 
              (select [id] from [FTSwordDef] with (nolock) 
                             where [word] like 'Fox')
        --and [wordPos] < 100000;
   order by [ftsIndex].[sID], [ftsIndex].[wordPos]
select count(distinct(#tempMatch1.[sID]))
from #tempMatch1 
join #tempMatch2
  on #tempMatch2.[sID] = #tempMatch1.[sID]
 and #tempMatch2.[wordPos] >  #tempMatch1.[wordPos]
 and #tempMatch2.[wordPos] <= #tempMatch1.[wordPos] + 10

A slightly different join runs in 5 seconds (and has a different query plan).
But I cannot fix it with hints as it moves where it does one join.
And even the +1 has over 10 documents that have over 7,000 matches.

on [wXright].[sID] = [wXleft].[sID]
and [wXright].[wordPos] =  [wXleft].[wordPos] + 1

Full table def

CREATE TABLE [dbo].[FTSindex](
    [sID] [int] NOT NULL,
    [wordPos] [int] NOT NULL,
    [wordID] [int] NOT NULL,
    [charPos] [int] NOT NULL,
 CONSTRAINT [PK_FTSindex] PRIMARY KEY CLUSTERED 
(
    [sID] ASC,
    [wordPos] ASC
)WITH (PAD_INDEX  = ON, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 100) ON [PRIMARY]
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[FTSindex]  WITH CHECK ADD  CONSTRAINT [FK_FTSindex_FTSwordDef] FOREIGN KEY([wordID])
REFERENCES [dbo].[FTSwordDef] ([ID])
GO

ALTER TABLE [dbo].[FTSindex] CHECK CONSTRAINT [FK_FTSindex_FTSwordDef]
GO
paparazzo
  • 44,497
  • 23
  • 105
  • 176
  • I don't know with all of your data but have you thought of maybe inserting into the temp tables and then creating the clustered indexes on them then? Insert first, then create index. This is usually faster than creating an index by itself. This may help you, it may not so I thought to add it as a comment. – djangojazz May 29 '13 at 21:45
  • @djangojazz That insert only takes 5 seconds. If I add a sort so records are inserted in PK order it is still 5 seconds. – paparazzo May 29 '13 at 21:57
  • We will need the table/key/index definitions and the query plan (actual). Also, is there any reason for this design/approach, as opposed to just using SQL Server Full-Text Search? – RBarryYoung May 29 '13 at 22:14
  • SQL Server Full-Text Search does not do "quick brown" within 10 of "fox or "coyote". Will add table def. How do you post a query plan? – paparazzo May 29 '13 at 22:21
  • You have to drop it in an online share somewhere and link to it. (My least favorite thing about StackOverflow) – RBarryYoung May 29 '13 at 22:24
  • @RBarryYoung I am remoted into the SQL box and cannot even copy the the file right now. I should be able to get it tomorrow. – paparazzo May 29 '13 at 22:42

2 Answers2

1

UPDATE:

You can still use union all which helps optimizer retain ordering from index if you delay filtering 'L' and 'R' sides until the last part of the process. Unfortunately, you need to retrieve all wordids beforehand and use them in equals condition. On my machine it reduces execution time to 2/3:

  ; with o as (
    select sID, wordPos, wordID
      from FTSindex 
     where wordID = 1
   union all
    select sID, wordPos, wordID
      from FTSindex 
     where wordID = 4
   union all
    select sID, wordPos, wordID
      from FTSindex 
     where wordID = 2
 ),
 g as (
    select sID, wordPos, wordID,
           ROW_NUMBER() over (partition by [sID] order by wordPos) rn
      from o
 )
 select count(distinct(g1.sID))   --   26919 00:02 
      from g g1
      join g g2
        on g1.sID = g2.sID 
       and g1.rn  = g2.rn - 1
       and g1.wordPos >= g2.wordPos - 10 
    -- Now is the time to repartition the stream
       and g1.wordID in (1, 4)
       and g2.wordID = 2

Oh, is it really taking two seconds now?

UPDATE - 2:

; with o as (
 -- Union all resolves costly sort
    select sid, wordpos, wordid
      from FTSindex 
     where wordID = 1
     union all
    select sid, wordpos, wordID
      from FTSindex 
     where wordID = 2
),
g as (
    select sid, wordid, wordpos,
           ROW_NUMBER() over(order by sid, wordpos) rn
      from o
)
select count(distinct g1.sid)
  from g g1
 inner join g g2
    on g1.sID = g2.sID 
   and g1.rn = g2.rn - 1
 where g1.wordID = 1
   and g2.wordID = 2
   and g1.wordPos >= g2.wordpos - 10

1 and 2 stand for selected words' ids. The results are different than ones produced by original query in regard of multiple hits within 10 words; original query will report all of them but this one will show the closest one only.

The idea is to extract only the words searched for and compare the distance between two neighbouring words, where wordID 1 comes first and wordID 2 second.

UPDATE - 1:

I took down this post because it did not perform as well as I have thought. But, it suites OP's needs better than optimized query because it allows for multiple words being searched at the same time (a list of words found in close proximity of another word(s) which might be specified in where clause).

; with g as (
    select sid, wordid, wordpos,
           ROW_NUMBER() over(order by sid, wordpos) rn
      from FTSindex
     where wordID in (1, 2)
)
select count(distinct g1.sid)
  from g g1
 inner join g g2
    on g1.sID = g2.sID 
   and g1.rn = g2.rn - 1
 where g1.wordID = 1
   and g2.wordID = 2
   and g1.wordPos >= g2.wordpos - 10

FIRST ATTEMPT:

There might be a way using cross apply in combination with top 1.

select [wXleft].[sID], [wXleft].[wordPos]
  from [ftsIndex] wXleft with (nolock)
 cross apply 
 (
    select top 1 r.sID 
      from [ftsIndex] r 
     where r.sID = wXleft.sID 
       and r.wordPos > wxLeft.wordPos 
       and r.wordPos <= wxLeft.wordPos + 10 
       and r.wordID in
           (select [id]
              from [FTSwordDef] with (nolock) 
             where [word] like 'Fox') 
 ) wXright
 where [wXleft].[wordID] in 
       (select [id] 
          from [FTSwordDef] with (nolock) 
         where [word] like 'Brown') 

BONUS PIVOT ATTEMPT:

; with o as (
    select sid, wordpos, wordid
      from FTSindex 
     where wordID = 1
     union all
    select sid, wordpos, wordID
      from FTSindex 
     where wordID = 2
),
g as (
    select sid, wordid, wordpos,
           ROW_NUMBER() over(order by sid, wordpos) rn
    from o
)
select sid, rn, [1], [2]
from
(
-- Collapse rns belonging to wordid 2 to ones belonging to wordid 1
-- so they appear in the same row
   select sid, wordpos, wordid, rn - case when wordid = 1 then 0 else 1 end rn
   from g
) g1
pivot (max(wordpos) for wordid in ([1], [2])) u
where [2] - [1] <= 10
Community
  • 1
  • 1
Nikola Markovinović
  • 18,963
  • 5
  • 46
  • 51
  • Returns the same answer as inner loop join in 2/3 the time. Will wait a couple days for miracle answer before accepting this. Thanks. – paparazzo May 30 '13 at 01:04
  • Why did you take that other option down? It is faster. I have been trying to tweak it to try and get a little more out it. What is strange is the sort coming out of that CTE is what dominates the cost. – paparazzo Jun 01 '13 at 23:54
  • @Blam because my timing was wrong, it took as much as my original attempt. Meanwhile I have resolved the sorting part but I am troubled by Sql Server's need to execute CTE once per reference, and there are two references. I'll post new version in a minute. – Nikola Markovinović Jun 02 '13 at 00:04
  • On my test it was a little faster and I have been trying to figure out how to reduce the cost of that sort. Those two sorts are 70% of the cost even when that index is by that sort so in my mind it should be a cheap sort. – paparazzo Jun 02 '13 at 00:09
  • @Blam please take a look at new version. – Nikola Markovinović Jun 02 '13 at 00:12
  • This is complex. The second update is fast (like 1/4 the time of the first second) but for only for wordID = x. WordID in (x, y) then the second second is very flow. But the first update tolerates in. I am marking this as the answer and very much appreciate your help and time. But please also include your first update with a note that it tolerates in. I still feel like there is a way to beat this. But I feel like I should mark this as an answer as you answered the stated question. If you have more ideas I will test them. In real life I need in (w,y,z) or in (select ...). – paparazzo Jun 02 '13 at 01:27
  • @Blam Thank you. I've included original self-join attempt in my answer. I've also added pivoting attempt which was not as fast as others but might open up space for new ideas. Unfortunately it calls for dynamic sql to build the list of wordids. I'll see if I could come up with some ideas later; it is 4AM here. Good luck. – Nikola Markovinović Jun 02 '13 at 01:54
  • See hard code of L and R to avoid the in. I did not test the pivot. – paparazzo Jun 02 '13 at 14:33
  • Hi @Blam. I've updated my answer with `union all` variant using your current query. – Nikola Markovinović Jun 03 '13 at 00:46
1

Well, I wish I had more information or a way to test, but failing that, this is what I would probably try:

 IF OBJECT_ID(N'tempdb..#tempMatch', N'U') IS NOT NULL   DROP TABLE #tempMatch
 CREATE TABLE #tempMatch(
    [sID] [int] NOT NULL,
    [wordPos] [int] NOT NULL,
    [wordID] [int] NOT NULL,
 CONSTRAINT [PK2] PRIMARY KEY CLUSTERED 
(
    [sID] ASC,
    [wordPos] ASC
))

--
;WITH cteWords As 
(
            SELECT 'Brown' as [word]
  UNION ALL SELECT 'Fox'
)
INSERT INTO #tempMatch ([sID],[wordPos],[wordID])
SELECT sID, wordPos, wordID
FROM    ftsIndex
WHERE   EXISTS
        (Select * From FTSWordDef s1
         inner join cteWords s2 ON s1.word = s2.word
         Where ftsIndex.wordID = s1.id)
;

select count(distinct(s1.[sID]))
    from #tempMatch s1
    join #tempMatch s2
        on  s2.[sID] = s1.[sID]
        and s2.[wordPos] >  s1.[wordPos]
        and s2.[wordPos] <= s1.[wordPos] + 10
    where s1.wordID = (select id from FTSWordDef w where w.word = 'Brown')
      and s2.wordID = (select id from FTSWordDef w where w.word = 'Fox')

I came up with one alternate version last night. It's the same queries as above, but the CREATE statement is changed to:

 IF OBJECT_ID(N'tempdb..#tempMatch', N'U') IS NOT NULL   DROP TABLE #tempMatch
 CREATE TABLE #tempMatch(
    [sID] [int] NOT NULL,
    [wordID] [int] NOT NULL,
    [wordPos] [int] NOT NULL,
 CONSTRAINT [PK0] PRIMARY KEY CLUSTERED 
(
    [wordID] ASC,
    [sID] ASC,
    [wordPos] ASC
))

Please let me know if these help at all.

paparazzo
  • 44,497
  • 23
  • 105
  • 176
RBarryYoung
  • 55,398
  • 14
  • 96
  • 137
  • Had to add wordID to the constraint on the first and both throw an error on the join cteWords. – paparazzo May 31 '13 at 14:14
  • @Blam What's the error? I cannot test the compile since we do not have the table definitions. – RBarryYoung May 31 '13 at 14:25
  • @Blam Why would you have to add wordID to the first constraint? According to your post, `(sID, wordPos)` ought to be sufficient as they are the primary key for the only table that I am drawing from in the `INSERT..SELECT..`. (In fact, now that I look at it, I realize that the `DISTINCT` is redundant and shouldn't be there) – RBarryYoung May 31 '13 at 14:29
  • All I know is I was getting a PK violation. It confused me also as sID and wordPos are the PK on the table. The error is Msg 208, Level 16, State 1, Line 40 Invalid object name 'cteWords'. – paparazzo May 31 '13 at 14:30
  • @Blam `"Invalid object name 'cteWords'"` makes no sense either. There's something not right here. – RBarryYoung May 31 '13 at 14:34
  • OK, I have copied your just posted Table Defs and I am getting it too. Give me a minute to figure out why ... – RBarryYoung May 31 '13 at 14:44
  • @Blam OK, I found it. The last query should have been referencing `FTSWordDef` and not `cteWords`, which was only defined for the earlier `INSERT`. I have corrected this in my post so please try them now. thanks – RBarryYoung May 31 '13 at 14:51
  • You worID and wordPos out of order so it was getting a duplicate from the wrong column. The first ran for three minutes and I gave up. The second was a little faster than Nicola. I need to verify all three give the same results (not just the same count). Thanks – paparazzo May 31 '13 at 15:55