2

I am working in an environment where I have a view, [stages].[jobStages], that contains the JobNumber, Region, and the the dates it completes each stage of the job. From this view I have created another view, [forecast].[DurationTable], containing the time between the stages {i.e. Duration1 = Stage2 - Stage1}. Example below:

JobNumber Region Stage1 Stage2 Stage3 Duration1 Duration2
12345 Houston 3/22/2022 4/18/2022 5/25/2022 27 37
42421 Houston 10/09/2021 11/29/2021 1/6/2022 51 38
98984 Houston 1/1/2022 2/2/2022 3/3/2022 32 29
78900 Dallas 4/23/2022 5/17/2022 5/29/2022 24 12

Now I need to create another view that finds the average time between each stage, by region, only using instances of that stage that have been completed within the past 4 months. An example of what this looks like (assuming today's date to be June 1st 2022):

JobNumber Region Stage1 Stage2 Stage3 Duration1 Duration2 AvgDuration1 AvgDuration2
12345 Houston 3/22/2022 4/18/2022 5/25/2022 27 37 29.5 33
42421 Houston 10/09/2021 11/29/2021 1/6/2022 51 38 29.5 33
98984 Houston 1/1/2022 2/2/2022 3/3/2022 32 29 29.5 33
78900 Dallas 4/23/2022 5/17/2022 5/29/2022 24 12 24 12

In the above example AvgDuration1 for the Houston jobs only considers the first and third record as Stage2 was completed within the last 4 months of todays date (June 1st) on these records so it averages Duration1 of 27 and 32.

I have created a query resembling the one below however it times out before execution. My actual query must AVG() 8 times as there are more stages than this example.

SELECT [JobNumber]
,[Region]
,[Stage1]
,[Stage2]
,[Stage3]
,[Duration1]
,[Duration2]
,(
    SELECT AVG(Duration1)
    FROM [forecast].[DurationTable]
    WHERE DATEDIFF(month, Stage2, GETDATE()) <= 4
    GROUP BY Region
) AS AvgDuration1
,(
    SELECT AVG(Duration2)
    FROM [forecast].[DurationTable]
    WHERE DATEDIFF(month, Stage3, GETDATE()) <= 4
    GROUP BY Region
) AS AvgDuration2
FROM [forecast].[DurationTable]

I am not very experienced with SQL but imagine that calling the [forecast].[DurationTable] and performing a calculation so many times is load intensive. I have tried forming [DurationTable] as a CTE but this did not increase performance. How can I achieve my result table in a reasonable amount of time (sub 15 minutes)?

  • 1
    Without commenting on the overall query structure, Aaron Bertrand's answer [here](https://stackoverflow.com/a/10854024/7165279) explains how rearranging the `datediff` expression in your `where` clause *could* make a big difference. – allmhuran Jul 22 '22 at 15:23
  • It didn't help much in this situation but was an interesting read that I'm glad to be aware of. Thanks! – JB_DataScientist Jul 22 '22 at 17:22
  • For performance questions, we need at a minimum the tables and index definitions, and please share the query plan via https://pastetheplan.com. Without this info it's impossible to answer properly – Charlieface Jul 22 '22 at 18:39

1 Answers1

2

Your current query is invalid due to the GROUP BY Region as it would raise an error if more than one region exists - I assume that there is supposed to be a correlation on Region also.

An improvement is likely to be

SELECT [JobNumber],
       [Region],
       [Stage1],
       [Stage2],
       [Stage3],
       [Duration1],
       [Duration2],
       AvgDuration1 = AVG(CASE WHEN DATEDIFF(month, Stage2, GETDATE()) <= 4 THEN Duration1 END) OVER (PARTITION BY Region),
       AvgDuration2 = AVG(CASE WHEN DATEDIFF(month, Stage3, GETDATE()) <= 4 THEN Duration2 END) OVER (PARTITION BY Region)
FROM   [forecast].[DurationTable] 

Building views on top of views is often problematic though.

It is unclear how these Stage and Duration columns in the view are derived but you may end up with something much better performing if you write your queries against the base tables rather than the view.

Martin Smith
  • 438,706
  • 87
  • 741
  • 845
  • I noticed the group by would break it as soon as I added it in here but thought it at least got my point across so left it in. I just used your version and it appears to not only work but much faster than expected (while referencing the view). I still need to check that the results are correct for what I'm expecting but will keep in mind that I could potentially un-nest the views. – JB_DataScientist Jul 22 '22 at 17:21
  • @JB_DataScientist For the above query to work efficiently, you need the following index `DurationTable (Region) INCLUDE (JobNumber, Region, Stage1, Stage2, Stage3, Duration1, Duration2)` – Charlieface Jul 22 '22 at 18:40