How do I optimize this SQL query comparing a value in two possible columns?

Question

I've written a SQL query that uses six tables to construct the output for a separate C# program, and I'm looking for a way to speed up the search.

I walked through the execution plan and I notice that one spot in particular is taking up 85% of the execution time, labeled with a comment in the code block below with --This spot right here.

select distinct 
    ta.account_num as 'Account', 
    tl.billing_year as 'Year', 
    tl.billing_month as 'Month', 
    ta.bill_date as 'Bill Date', 
    DATEDIFF(DD, cast(cast(tl.billing_year as varchar(4)) + right('0' + cast(tl.billing_month as varchar(2)), 2) + right('0' + (case when billing_month in (4,6,9,11) and bill_date > 30 then '30' when billing_month = 2 and bill_date > 28 then '28' else cast(bill_date as varchar(2)) end), 2) as datetime), GETDATE()) as 'Past', 
    DATEADD(Day,10,d) as 'To be Loaded Before',
    p.provider_name as 'Provider', 
    c.client as 'Client',
    tip.invoice_load_type as 'Load Type'
from 
    tm_invoice_load tl
inner join 
    tm_client c on tl.client_id = c.client_id
inner join 
    tm_client_account ta on (ta.account_num = tl.account_num or ta.pilot = tl.account_num) --This spot right here
inner join 
    provider p on p.id_provider = ta.id_provider
inner join 
    tm_calendar cal on DATEPART(DAY, d) = DATEPART(DAY, entry_dt)
                    and DATEPART(MONTH, d) = DATEPART(MONTH, entry_dt)
                    and DATEPART(YEAR, d) = DATEPART(YEAR, entry_dt)
inner join 
    tm_invoice_load_type tip on tip.invoice_load_type_id = ta.invoice_load_type_id
where
    not exists (select top 1 id 
                from tm_invoice_load 
                where billing_year = tl.billing_year 
                  and billing_month = tl.billing_month 
                  and status_id = 1 
                  and (account_num = ta.account_num or account_num = ta.pilot))
    and ta.status_id = 1
    --and ta.invoice_load_type_id = 2
    and tl.status_id = 2
    and (ta.pilot is null or ta.account_num <> ta.pilot)
order by 
    c.client, p.provider_name, ta.account_num, tl.billing_year, tl.billing_month

Above, it's when joining tm_client_account, where it has an account number column, and a pilot in case it is a child to another account. When such a thing happens, the parent account is NOT selected (ta.pilot is null or ta.account_num <> ta.pilot), and instead the child accounts are shown.

The query works exactly as intended, but it's kinda slow, and as these tables grow (and they are doing so on a nearly exponential curve) it will only get worse.

Is there some way that I can accomplish this join in a faster way? Even small gains would be great!

If it helps, I'm running this on SQL Server 2008 R2.

Here is a screenshot of the execution plan. If needed, I can provide more/different information.

For perfomance question you have to include [QUERY PLAN](http://stackoverflow.com/questions/7359702/how-do-i-obtain-a-query-execution-plan) — Juan Carlos Oropeza, Jan 21 '16 at 21:09
@JuanCarlosOropeza Is there a particular way to best present it here? Based on the post you linked to, there are more than a few, and I wonder which is preferred here, as I have not previously asked any such questions... — Nicolas Coombs, Jan 25 '16 at 16:06
Regarding the query-plan, ideally you run the query in SqlSentry Plan Explorer (free is fine), then do File/Save as.. and save it as a .queryanalyis file. Zip and share it somewhere. If that's not an option you can do the same in SSMS; right click the Query Plan and choose save as.... Works too but retains less info. Anyway, from the screenshot it seems there is one particular JOIN that eats 85% of the cost, check the pop-up to get more info when you hoover over it. — deroby, Jan 26 '16 at 20:46
@deroby I actually put a comment in the code block indicating which `join` is causing the problem (`tm_client_account ta on (ta.account_num = tl.account_num or ta.pilot = tl.account_num)`)... I got that far, but I wasn't sure how to change it in a way that would be more efficient... — Nicolas Coombs, Jan 27 '16 at 03:53

deroby · Answer 1 · 2016-01-27T11:18:17.947

I don't think there is anything really wrong with the query. I tend to keep things like ta.status_id = 1 inside the relevant (ta) JOIN clause rather than in the WHERE, but the query optimizer is smart enough to handle that.

The one thing that I'd suggest to change is this one:

inner join 
tm_calendar cal on DATEPART(DAY, d) = DATEPART(DAY, entry_dt)
                and DATEPART(MONTH, d) = DATEPART(MONTH, entry_dt)
                and DATEPART(YEAR, d) = DATEPART(YEAR, entry_dt)

Try to replace it with this and see what happens:

inner join 
tm_calendar cal on Convert(date, d) = Convert(date, entry_dt)

The result should be identical but this way the system can still use the index (and statistics) on the d and/or entry_dt fields. Convert(date,...) is one of the few SARG-able converts there are.

Apart from that it all depends on the indexes available, the amount and distribution of the data.

UPDATE: as you mention that the heavy cost (85%) seems to come from the ta part, here some thoughts

I've taken the liberty of reformatting the query a bit so it makes more sense to me. Mainly I've grouped the relevant parts together so it's more clear (to me) what's being done on what table.

SELECT DISTINCT 
       ta.account_num as 'Account', 
       tl.billing_year as 'Year', 
       tl.billing_month as 'Month', 
       ta.bill_date as 'Bill Date', 
       DATEDIFF(DD, cast(cast(tl.billing_year as varchar(4)) + right('0' + cast(tl.billing_month as varchar(2)), 2) + right('0' + (case when billing_month in (4,6,9,11) and bill_date > 30 then '30' when billing_month = 2 and bill_date > 28 then '28' else cast(bill_date as varchar(2)) end), 2) as datetime), GETDATE()) as 'Past', 
       DATEADD(Day,10,d) as 'To be Loaded Before',
       p.provider_name as 'Provider', 
       c.client as 'Client',
       tip.invoice_load_type as 'Load Type'
  FROM tm_invoice_load tl

  JOIN tm_client c 
    ON tl.client_id = c.client_id

  JOIN tm_client_account ta 
    ON (ta.account_num = tl.account_num or ta.pilot = tl.account_num) --This spot right here
   AND ta.status_id = 1
   AND (ta.pilot IS NULL OR ta.account_num <> ta.pilot)
 --AND ta.invoice_load_type_id = 2

  JOIN provider p 
    ON p.id_provider = ta.id_provider

  JOIN tm_invoice_load_type tip 
    ON tip.invoice_load_type_id = ta.invoice_load_type_id

  JOIN tm_calendar cal 
    ON Convert(date, cal.d) = Convert(date, tl.entry_dt)

 WHERE tl.status_id = 2
   AND NOT EXISTS (SELECT *
                     FROM tm_invoice_load xx
                    WHERE xx.billing_year = tl.billing_year 
                      AND xx.billing_month = tl.billing_month 
                      AND xx.status_id = 1 
                      AND (xx.account_num = ta.account_num OR xx.account_num = ta.pilot))

  ORDER BY c.client, p.provider_name, ta.account_num, tl.billing_year, tl.billing_month

Could you give the query as it is here a try and time it + check the query plan?

Focussing on tm_client_account the next logical step would be to add a specific index on this table for this query. The problem is that the filtered index capabilities of MSSQL are ... well, challenging =)

The closest I get is this:

CREATE INDEX idx_test ON tm_client_account (account_num) INCLUDE (pilot, bill_date, id_provider, invoice_load_type_id) WHERE (status = 1)

Could you create the index and see if that helps? If the query-plan shows that rather continues using a table-scan on tm_client_account then add WITH (INDEX = idx_test) to the JOIN clause and try again.

If all that doesn't really cut down on the time the query takes you could try pre-filtering things by means of an indexed view. WARNING: Keep in mind though that adding indexes to the table will cause some performance degradation when doing INSERT/UPDATE/DELETE on the table; adding a filtered view on it will double-so! If the data in this table (very) volatile, the indexes and/or view might make other parts of your system noticably slower. Then again, SQL is pretty good at this so the only way to know for sure is by testing.

Step 1: create the view

CREATE VIEW v_test_tm_client_account_filtered
WITH SCHEMABINDING
AS
SELECT id_client_account, -- I am assuming that there is an identity-field like this on the table, adapt as needed!
       account_num, 
       pilot,
       bill_date, 
       id_provider,
       invoice_load_type_id
  FROM tm_client_account ta
 WHERE status = 1
   AND ta.status_id = 1
   AND (ta.pilot IS NULL OR ta.account_num <> ta.pilot)
 --AND ta.invoice_load_type_id = 2

Step 2: index the view

CREATE UNIQUE CLUSTERED INDEX idx_test ON v_test_tm_client_account_filtered (account_num, pilot, id_client_account)

Step 3: adapt the query to use the view rather than the table directly

  JOIN tm_client_account ta 
    ON (ta.account_num = tl.account_num or ta.pilot = tl.account_num) --This spot right here
   AND ta.status_id = 1
   AND (ta.pilot IS NULL OR ta.account_num <> ta.pilot)
 --AND ta.invoice_load_type_id = 2

becomes

  JOIN v_test_tm_client_account_filtered ta 
    ON ta.account_num = tl.account_num 
    OR ta.pilot       = tl.account_num

And then run the query once more...

After finally having enough time to test all proposed adjustments, I was very surprised to find that the original query was still faster than anything here (in fairness, I suspect total server load was the largest factor, bigger than anything in these queries). The difference was often less than 200ms over a 7 second span. Strangely, the reorganizing you did for readability - which should have had no impact - added almost a consistent second to the query time. I used 4 versions of this query for testing to compare options, and when there was a difference, the original was still the fastest. — Nicolas Coombs, Feb 02 '16 at 15:43
In fairness, the first index (on tm_client_account) was in place for all testing, but I noticed that by itself it did very little (a consistent gain of less than 200ms). — Nicolas Coombs, Feb 02 '16 at 15:46
RDBMSs can be very much alike black magic sometimes =) If you really need to squeze down those 7 seconds I'd suggest to give QueryPlan Explorer a go and see if it shows anything worth tweaking further. Another option might be to JOIN only some tables first, dump the result in a temp-table, maybe index said temp-table and then JOIN all the rest. Judging which tables to 'pre-JOIN' is off course tricky. This largely depends on how well you know the data at hand and how well you can estimate how things will inter-connect. And whether this will also be the case for other load data. Good luck! — deroby, Feb 03 '16 at 10:17

score 0 · Answer 2 · answered Jun 28 '16 at 19:50

In case someone comes across this with a similar issue, the only way I've been able to make this any faster has been to tell the query not to lock the data. I don't know why this helped, but it cut the query time in less than half. This has had no effect on the returned data, and explicitly sorting it makes it unapparent.

select distinct 
    ta.account_num as 'Account', 
    tl.billing_year as 'Year', 
    tl.billing_month as 'Month', 
    ta.bill_date as 'Bill Date', 
    DATEDIFF(DD, cast(cast(tl.billing_year as varchar(4)) + right('0' + cast(tl.billing_month as varchar(2)), 2) + right('0' + (case when billing_month in (4,6,9,11) and bill_date > 30 then '30' when billing_month = 2 and bill_date > 28 then '28' else cast(bill_date as varchar(2)) end), 2) as datetime), GETDATE()) as 'Past', 
    DATEADD(Day,10,d) as 'To be Loaded Before',
    p.provider_name as 'Provider', 
    c.client as 'Client',
    tip.invoice_load_type as 'Load Type'
from 
    tm_invoice_load tl

with (nolock) --This accelerates the select statement!

inner join 
    tm_client c on tl.client_id = c.client_id
[...]

Meanwhile a computed column was added, and this query is still down from 7-8 seconds to 2.5-3.5 seconds. Again, I don't know why, but it helped. It might help someone else too.

How do I optimize this SQL query comparing a value in two possible columns?

2 Answers2