How to efficiently add a column corresponding to a minimum value in a diffrent column to a view?

Question

Problem outline

I've got a performance problem with a view that is to be expanded to include a value associated to a 'smallest/greatest' value. Doing so more than halves the performance even with my smaller test dataset. The real database/query contains many more tables and joins, but these are all either fairly small tables, or simple FK joins. I've reduced the problem down to what I think is a minimal example, but may have inadvertently made more solutions possible that aren't practical with the real case. Please assume in any solution there are an additional ~10-20 ish tables and joins involved, though all are of the simple kind; e.g. lookup tables for various values, like the names of vehicles and such (the real db is more normalized). For example the 'City' value is not actually in Deliveries, but two JOINs removed, but those JOINS are 1:1, they can only ever return a single-row per source row. So any solution that completely upends the underlying query and thus makes it much more difficult to expand it with a simple join is not a great solution. Assume the following (example!) database tables. All number tables are INTs, all 'ID' columns have a PK, all 'fkX' are a foreign key relation to the named table.

Example Data

Table: Deliveries

+----+------------+---------+--------+
| ID |    City    | fkRoute | Number |
+----+------------+---------+--------+
|  1 | New York   |       1 |      1 |
|  2 | Boston     |       1 |      2 |
|  3 | Baltimore  |       2 |      1 |
|  4 | Baltimore  |       2 |      1 |
|  5 | Washington |       2 |      2 |
|  6 | Washington |       3 |      1 |
|  7 | Newark     |       3 |      1 |
|  8 | Portland   |       3 |      2 |
|  9 | Yonkers    |       3 |      3 |
+----+------------+---------+--------+

Table: [Routes]

+----+------------+---------+
| ID |    Date    | Vehicle |
+----+------------+---------+
|  1 | 2021-07-07 | Truck A |
|  2 | 2021-07-07 | Lorry B |
|  3 | 2021-07-08 | Truck A |
+----+------------+---------+

Table: Products

+----+------------+-------+
| ID | fkDelivery | Name  |
+----+------------+-------+
|  1 |          1 | W597  |
|  2 |          2 | W597  |
|  3 |          3 | T224  |
|  4 |          4 | DW F5 |
|  5 |          4 | W587  |
|  6 |          5 | W597  |
|  7 |          5 | W613  |
|  8 |          6 | DW F8 |
|  9 |          7 | T224  |
| 10 |          8 | W597  |
| 11 |          8 | DW-F8 |
| 12 |          8 | AS71  |
| 13 |          9 | T224  |
+----+------------+-------+

The problematic Query

Next, I have this current view (made using just some simple joins, a GROUP BY on [Routes].ID, and some aggregate 'COUNT()' to get the numbers). Such a query would look like:

SELECT MIN([Routes].Date) AS Date, MIN([Routes].Vehicle) AS Vehicle, 
    COUNT(Products.Name) AS nP, COUNT(DISTINCT Deliveries.ID) AS nD 
FROM [Routes] 
INNER JOIN Deliveries ON Deliveries.fkRoute = [Routes].ID 
INNER JOIN Products ON Products.fkDelivery = Deliveries.ID
GROUP BY [Routes].ID

It measures number of stuff delivered, and number of locations to deliver to.

Current View:

+------------+---------+----+----+
|    Date    | Vehicle | nP | nD |
+------------+---------+----+----+
| 2021-07-07 | Truck A |  2 |  2 |
| 2021-07-07 | Lorry B |  5 |  3 |
| 2021-07-08 | Truck A |  6 |  4 |
+------------+---------+----+----+

A Typical Query that uses this view is:

SELECT * FROM routeView WHERE '2021-07-07 00:00:00' <= [Date] 
    AND '2021-07-08 23:59:59' >= [Date]

Now this view is to be expanded. With a column that is proving to be a nasty thorn to program efficiently.

New View:

+------------+---------+----+----+-----------+
|    Date    | Vehicle | nP | nD | FirstCity |
+------------+---------+----+----+-----------+
| 2021-07-07 | Truck A |  2 |  2 | New York  |
| 2021-07-07 | Lorry B |  5 |  3 | Baltimore |
| 2021-07-08 | Truck A |  6 |  4 | Newark    |
+------------+---------+----+----+-----------+

FirstCity is the City corresponding to the 'lowest' Number. This can be ambiguous (users don't number their [Routes] consistently) by multiple cities having the lowest number; in that case it should just pick the first row. Numbers may include zero, but not negative values. My initial idea is this:

SELECT MIN([Routes].Date) AS Date, MIN([Routes].Vehicle) AS Vehicle, 
    COUNT(Products.Name) AS nP, COUNT(DISTINCT Deliveries.ID) AS nD, 
    MIN(CASE WHEN Deliveries.Number = minTable.minDN THEN Deliveries.City END) AS FirstCity
FROM [Routes] 
INNER JOIN Deliveries ON Deliveries.fkRoute = [Routes].ID 
INNER JOIN Products ON Products.fkDelivery = Deliveries.ID
LEFT JOIN (SELECT MIN(D2.Number) AS minDN, fkRoute
    FROM Deliveries AS D2 GROUP BY fkRoute) AS minTable ON [Routes].ID = minTable.fkRoute 
GROUP BY [Routes].ID

This query returns the view above. It makes the arbitrary choice to return the 'alphabetically' first city when there's ambiguous data (there's no requirement there). It works... but not very fast. With the example data it's no problem, but with a real data set (of say some millions of rows), the main problem is with the sub-query. It scans/fetches the whole 'Deliveries' table. But it really only needs to look at the deliveries between 07-07 and 07-08 (with the example query), so it's doing lots of redundant work. I've been trying to adapt some of the solutions in 'SQL antipatterns', ch. 15, but without much success; I either get syntax errors for some of them, or they mess up the 'nP'/'nD' data as the number of rows returned from the primary 'base' query is changed through WHERE conditions.

What would be a good way to achieve the desired speedup?

Note: If it were just a query then I could add the 'date' condition to both the sub-query and the primary query as a trivial solution. In that case the index is used on both queries. Yet this method doesn't seem to work for a view.

To reproduce:

Full T-SQL code to re-create the test database below:

USE [testDB]
GO

SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[[Routes]](
[ID] [int] IDENTITY(1,1) NOT NULL,
[Date] [date] NOT NULL,
[Vehicle] [nvarchar](50) NOT NULL,
 CONSTRAINT [PK_[Routes]] PRIMARY KEY CLUSTERED 
(
    [ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

CREATE TABLE [dbo].[Deliveries](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [City] [nvarchar](50) NOT NULL,
    [fkRoute] [int] NOT NULL,
    [Number] [int] NOT NULL,
 CONSTRAINT [PK_Deliveries] PRIMARY KEY CLUSTERED 
(
    [ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[Deliveries]  WITH CHECK ADD  CONSTRAINT [FK_Deliveries_[Routes]] FOREIGN KEY([fkRoute])
REFERENCES [dbo].[[Routes]] ([ID])
ON UPDATE CASCADE
ON DELETE CASCADE
GO

ALTER TABLE [dbo].[Deliveries] CHECK CONSTRAINT [FK_Deliveries_[Routes]]
GO

CREATE TABLE [dbo].[Products](
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [fkDelivery] [int] NOT NULL,
    [Name] [nvarchar](50) NOT NULL,
 CONSTRAINT [PK_Products] PRIMARY KEY CLUSTERED 
(
    [ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

GO

ALTER TABLE [dbo].[Products]  WITH CHECK ADD  CONSTRAINT [FK_Products_Deliveries] FOREIGN KEY([fkDelivery])
REFERENCES [dbo].[Deliveries] ([ID])
ON UPDATE CASCADE
ON DELETE CASCADE
GO

ALTER TABLE [dbo].[Products] CHECK CONSTRAINT [FK_Products_Deliveries]
GO

INSERT INTO [[Routes]] (Date, Vehicle) VALUES ('2021-07-07', 'Truck A'), ('2021-07-07', 'Lorry B'), ('2021-07-08', 'Truck A')
GO

INSERT INTO Deliveries (City, fkRoute, Number) VALUES 
    ('New York', 1, 1),
    ('Boston', 1, 2),
    ('Baltimore', 2, 1),
    ('Baltimore', 2, 1),
    ('Washington', 2, 2),
    ('Washington', 3, 1),
    ('Newark', 3, 1),
    ('Portland', 3, 2),
    ('Yonkers', 3, 3)
GO
    
INSERT INTO [Products] (fkDelivery, Name) VALUES 
    ('1', 'W597'),
    ('2', 'W597'),
    ('3', 'T224'),
    ('4', 'DW F5'),
    ('4', 'W587'),
    ('5', 'W597'),
    ('5', 'W613'),
    ('6', 'DW F8'),
    ('7', 'T224'),
    ('8', 'W597'),
    ('8', 'DW-F8'),
    ('8', 'AS71'),
    ('9', 'T224')
GO

CREATE NONCLUSTERED INDEX [inRoutes_Date] ON [dbo].[Routes]
(
    [Date] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)

GO

We need a [mre] emphasis on *minimal*. – Dale K Jul 29 '21 at 10:11 — Dale K, Jul 29 '21 at 10:11

aphid · Accepted Answer · 2021-07-30T08:31:57.470

It turns out that in this particular case, there's a big difference in performance between the two main techniques employed in T-SQL in selecting functionally dependent columns. You would think that:

SELECT MAX(depCol), ID FROM myTable GROUP BY ID

And:

SELECT depCol, ID FROM myTable GROUP BY ID, depCol

Are functionally the same thus very similar in performance, but here it turns out there's an on the order of 10x discrepancy between the two if the Deliveries.Number column is left out. I do not know yet why or how this happens: Perhaps someone with far more understanding of the details/low-level stuff can enlighten me and improve this answer¹, but I could not observe any major change in the query plan that would suggest this using SSMS's query plan analyzer. The original solution (with the real data) has this performance for a sample query;

SQL Server Execution Times:
CPU time = 484 ms,  elapsed time = 628 ms.

Replacing the suggested solution query with:

SELECT [Routes].[Date] AS Date, [Routes].[Vehicle] AS Vehicle, 
    COUNT(Products.Name) AS nP, COUNT(DISTINCT Deliveries.ID) AS nD, 
    MIN(CASE WHEN Deliveries.Number = minTable.minDN THEN Deliveries.City END) AS FirstCity
FROM [Routes] 
INNER JOIN Deliveries ON Deliveries.fkRoute = [Routes].ID 
INNER JOIN Products ON Products.fkDelivery = Deliveries.ID
LEFT JOIN (SELECT MIN(D2.Number) AS minDN, fkRoute
    FROM Deliveries AS D2 GROUP BY fkRoute) AS minTable ON [Routes].ID = minTable.fkRoute 
GROUP BY [Routes].[ID], [Routes].[Vehicle], [Routes].[Date]

The result is now;

SQL Server parse and compile time: 
CPU time = 47 ms,  elapsed time = 207 ms.

(And so on and so forth for the 20-odd additional columns in the real scenario) gets most of the speedup. Since Deliveries.Number is ambiguous, the MIN() clause is necessary here to not get multiple rows returned for the same value of Routes.ID, but the result is still much better than before.

If it is really necessary to squeeze out every last drop, the Scalar-aggregate reduction method suggested in another answer applied to the last remaining MIN()-using column can get even more performance, although it's harder to maintain and understand.

Edit:

1: I have perhaps one little clue to the mystery. The MIN() function could be doing some sorting. If it uses some variant of Quicksort in the background, then the worst case performance may be that of an already sorted set, or a set where all the numbers are the same. And let that be our particular case: because of the JOINs and COUNTs, the result set, before being squashed, is much bigger (one row for every combination of product and delivery in a route), and then some O(n²) algorithm is applied to that.

How to efficiently add a column corresponding to a minimum value in a diffrent column to a view?

1 Answers1