2

Let's say I have the following two tables:

CREATE TABLE [dbo].[ActionTable] 
(
    [ActionID] [int] IDENTITY(1, 1) NOT FOR REPLICATION NOT NULL
    ,[ActionName] [varchar](80) NOT NULL
    ,[Description] [varchar](120) NOT NULL
    ,CONSTRAINT [PK_ActionTable] PRIMARY KEY CLUSTERED ([ActionID] ASC)
    ,CONSTRAINT [IX_ActionName] UNIQUE NONCLUSTERED ([ActionName] ASC)
    )
GO

CREATE TABLE [dbo].[BigTimeSeriesTable] 
(
    [ID] [bigint] IDENTITY(1, 1) NOT FOR REPLICATION NOT NULL
    ,[TimeStamp] [datetime] NOT NULL
    ,[ActionID] [int] NOT NULL
    ,[Details] [varchar](max) NULL
    ,CONSTRAINT [PK_BigTimeSeriesTable] PRIMARY KEY NONCLUSTERED ([ID] ASC)
    )
GO

ALTER TABLE [dbo].[BigTimeSeriesTable]
    WITH CHECK ADD CONSTRAINT [FK_BigTimeSeriesTable_ActionTable] FOREIGN KEY ([ActionID]) REFERENCES [dbo].[ActionTable]([ActionID])
GO

CREATE CLUSTERED INDEX [IX_BigTimeSeriesTable] ON [dbo].[BigTimeSeriesTable] ([TimeStamp] ASC)
GO

CREATE NONCLUSTERED INDEX [IX_BigTimeSeriesTable_ActionID] ON [dbo].[BigTimeSeriesTable] ([ActionID] ASC)
GO

ActionTable has 1000 rows and BigTimeSeriesTable has millions of rows.

Now consider the following two queries:

Query A

SELECT *
FROM BigTimeSeriesTable
WHERE TimeStamp > DATEADD(DAY, -3, GETDATE())
    AND ActionID IN (
        SELECT ActionID
        FROM ActionTable
        WHERE ActionName LIKE '%action%'
        )

Execution plan for query A

Query B

SELECT bts.*
FROM BigTimeSeriesTable bts
INNER JOIN ActionTable act ON act.ActionID = bts.ActionID
WHERE bts.TimeStamp > DATEADD(DAY, -3, GETDATE())
    AND act.ActionName LIKE '%action%'

Execution plan for query B

Question: Why does query A have better performance than query B (sometimes 10 times better)? Shouldn't the query optimizer recognize that the two queries are exactly the same? Is there any way to provide hints that would improve the performance of the INNER JOIN?

Update: I changed the join to INNER MERGE JOIN and the performance greatly improved. See execution plan here. Interestingly when I try the merge join in the actual query I'm trying to run (which I cannot show here, confidential) it totally messes up the query optimizer and the query is super slow, not just relatively slow.

Mark O
  • 23
  • 4
  • 2
    Look at the execution plans of those queries. – juergen d Feb 09 '20 at 07:35
  • If you know the IN is faster then why do you want to use the JOIN? –  Feb 09 '20 at 07:35
  • If the inner query returns just a few results that is faster than joining two tables together. – juergen d Feb 09 '20 at 07:36
  • @a_horse_with_no_name I'm already joining the two in order to return the ActionName in my query. That's not clear from the question, but I was trying to keep the example clean. – Mark O Feb 09 '20 at 07:41
  • Can you post the example Execution Plans - it is hard to create a repro considering the number of rows. – Martin Cairney Feb 09 '20 at 08:07
  • @MartinCairney posted. – Mark O Feb 09 '20 at 08:35
  • @Asraf looks to have identified where the difference is coming in. If you rewrite the query as `SELECT bts.* FROM BigTimeSeriesTable bts INNER JOIN ActionTable act ON act.ActionID = bts.ActionID AND act.ActionName LIKE '%action%' WHERE bts.Timestamp > DATEADD(DAY, -3, GETDATE())` do you then achieve the same plan and performance? – Martin Cairney Feb 09 '20 at 09:11
  • @MartinCairney - Asraf's answer is entirely wrong – Martin Smith Feb 09 '20 at 15:11

3 Answers3

0

The execution plans you have supplied both have exactly the same basic strategy.

Join

enter image description here

There is a seek on ActionTable to find rows where ActionName starts with "generate" with a residual predicate on the ActionName LIKE '%action%'. The 7 matching rows are then used to build a hash table.

On the probe side there is a seek on TimeStamp > Scalar Operator(dateadd(day,(-3),getdate())) and matching rows are tested against the hash table to see if the rows should join.

There are two main differences which explain why the IN version executes quicker

IN

enter image description here

  1. The IN version is executing in parallel. There are 4 concurrent threads working on the query execution - not just one.
  2. Related to the parallelism this plan has a bitmap filter. It is able to use this bitmap to eliminate rows early. In the inner join plan 25,959,124 rows are passed to the probe side of the hash join, in the semi join plan the seek still reads 25.9 million rows but only 313 rows are passed out to be evaluated by the join. The remainder are eliminated early by applying the bitmap inside the seek.

It is not readily apparent why the INNER JOIN version does not execute in parallel. You could try adding the hint OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE')) to see if you now get a plan which executes in parallel and contains the bitmap filter.

If you are able to change indexes then, given that the query only returns 309 rows for 7 distinct actions, you may well find that replacing IX_BigTimeSeriesTable_ActionID with a covering index with leading columns [ActionID], [TimeStamp] and then getting a nested loops plan with 7 seeks performs much better than your current queries.

CREATE NONCLUSTERED INDEX [IX_BigTimeSeriesTable_ActionID_TimeStamp]
  ON [dbo].[BigTimeSeriesTable] ([ActionID], [TimeStamp])
  INCLUDE ([Details], [ID])

Hopefully with that index in place your existing queries will just use it and you will see 7 seeks, each returning an average of 44 rows, to read and return only the exact 309 total required. If not you can try the below

SELECT CA.*
FROM ActionTable A
CROSS APPLY 
(
SELECT *
FROM BigTimeSeriesTable B
WHERE B.ActionID = A.ActionID AND B.TimeStamp > DATEADD(DAY, -3, GETDATE())
) CA
WHERE A.ActionName LIKE '%action%'
Community
  • 1
  • 1
Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • Part of what was so frustrating about this problem is that there was already an index on ActionID. I was trying to "materialize" the subquery in order to force that index to be used. I should have noticed that _none_ of the execution plans included that index, and getting parallelism to work was a red herring. There are two solutions here, a temp table or an index hint. The index hint is working great so far. Thanks for your answer, it was very helpful. – Mark O Feb 10 '20 at 04:08
  • @MarkO - if the index just has a key column of `ActionID` then SQL Server will still have to read all matching rows for an action, including those older than 3 days, with a residual predicate on `TimeStamp` to discard the unneeded ones. The optimal index will have key columns of `ActionID,TimeStamp` to allow exactly the right rows to be seeked – Martin Smith Feb 10 '20 at 08:06
  • Forcing IX_BigTimeSeriesTable_ActionID makes this particular query (and others like it) take a fraction of a second, even with a join and the big table having 200 million+ rows. Unfortunately updating indexes is not feasible in the near future so this will have to do. – Mark O Feb 10 '20 at 10:45
  • Also note that the table is clustered on `TimeStamp` so discarding unneeded rows on that predicate is not a big deal. – Mark O Feb 10 '20 at 10:48
  • @MarkO - Ah in that case you do actually have the ideal index already implicitly. For non unique non clustered indexes the clustered index key is appended to the key columns - so it is already `[ActionID], [TimeStamp]` even though only declared as `ActionID`. So each seek can do an equality on `ActionID` and then range seek on the secondary `TimeStamp` key – Martin Smith Feb 10 '20 at 10:54
0

I had some success using an index hint: WITH (INDEX(IX_BigTimeSeriesTable_ActionID))

However as the query changes, even slightly, this can totally hamstring the optimizer's ability to get the best query.

Therefore if you want to "materialize" a subquery in order to force it to execute earlier, your best bet as of February 2020 is to use a temp table.

Mark O
  • 23
  • 4
-1

For inner join there's no difference between filtering and joining

[Difference between filtering queries in JOIN and WHERE?

But here your codes create different cases

Query A: You are just filtering with 1000 record

Query B: You first join with millions of rows and then filter with 1000 records

So query A take less time than query B

  • What is done "first" is determined by the query plan. Not necessarily by the query you write. It's entirely possible (though unlikely) that Query A runs `WHERE ActionName LIKE '%action%'` after the `IN` – Nick.Mc Feb 09 '20 at 10:57