Background
I am working in a very restricted T-SQL environment, where one may only define the "body" of a VIEW
: presumably the ...
in
CREATE VIEW My_View AS ...
under the hood. Here is my @@VERSION
:
Microsoft SQL Server 2019 (RTM-CU19) (KB5023049) - 15.0.4298.1 (X64)
Jan 27 2023 16:44:09
Copyright (C) 2019 Microsoft Corporation
Web Edition (64-bit) on Linux (Amazon Linux 2)
I should note that these tables are synced from flat files, and as such, no formal schematic structure is preserved from the original source. That is, all "functional dependencies" are merely inferred (albeit reliably) from the column names and business concepts.
Problem
Suppose I have the following table called My_Measures
...
Person_ID | Name | Measure |
---|---|---|
1 | Greg | 0 |
1 | Greg | 10 |
2 | Devon | 20 |
2 | Devon | 30 |
...where Name
is functionally dependent on Person_ID
.
The Usual
Now suppose I wish to aggregate the Measure
into various summary statistics for each person. This is simple enough in SQL...
SELECT
Person_ID,
MIN(Measure) AS Min_Measure,
MAX(Measure) AS Max_Measure,
AVG(Measure) AS Avg_Measure
FROM
My_Measures
GROUP BY
Person_ID
...and yields the following results:
Person_ID | Min_Measure | Max_Measure | Avg_Measure |
---|---|---|---|
1 | 0 | 10 | 5 |
2 | 20 | 30 | 25 |
The Twist
But suppose I wish to include Name
alongside each Person_ID
, like so:
Person_ID | Name | Min_Measure | Max_Measure | Avg_Measure |
---|---|---|---|---|
1 | Greg | 0 | 10 | 5 |
2 | Devon | 20 | 30 | 25 |
Obviously, the following attempt...
SELECT
Person_ID,
-- ⌄⌄⌄⌄⌄
Name,
-- ^^^^^
MIN(Measure) AS Min_Measure,
MAX(Measure) AS Max_Measure,
AVG(Measure) AS Avg_Measure
FROM
My_Measures
GROUP BY
Person_ID
...will fail with this error:
Column 'My_Measures.Name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Attempts
I have found several unsatisfactory ways to yield the intended output.
(1) GROUP BY
Dependent Variables
One way is to GROUP BY
the Name
column after Person_ID
; and more generally, to append the dependent variable(s) at the end of the GROUP BY
clause:
SELECT
Person_ID,
-- ⌄⌄⌄⌄⌄
Name,
-- ^^^^^
MIN(Measure) AS Min_Measure,
MAX(Measure) AS Max_Measure,
AVG(Measure) AS Avg_Measure
FROM
My_Measures
GROUP BY
-- ⌄⌄⌄⌄⌄⌄
Person_ID, Name
-- ^^^^^^
This leaves the grouping unaltered, since the "real" grouping variable(s) (here Person_ID
) have already defined it, and the dependent variables are merely "tagging along". However, this wastes performance on processing (arbitrarily many) dependent variables, which may be more complex (CHAR
strings for Name
) for indexing purposes.
(2) "Aggregate" Dependent Variables
Another way is to "aggregate" the Name
column, with some function (like MIN()
) that gives us one representative value (like 'Greg'
) from many identical duplicates (like ('Greg', 'Greg')
).
SELECT
Person_ID,
-- ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
MIN(Name) AS Name,
-- ^^^^^^^^^^^^^^^^^^^^^
MIN(Measure) AS Min_Measure,
MAX(Measure) AS Max_Measure,
AVG(Measure) AS Avg_Measure
FROM
My_Measures
GROUP BY
Person_ID
This likewise achieves the intended result, but it likewise wastes performance on computing an aggregation of many identical values. Furthermore, it will only work on values that are comparable and thus have a MIN()
; but it will obviously fail for non-comparable datatypes.
(3) ReJOIN
After Aggregating
Perhaps the most disappointing way is to simply calculate the aggregates, and then reassociate the Person_ID
with its Name
via a JOIN
:
-- Aggregate by ID.
WITH agg AS(
SELECT
Person_ID,
MIN(Measure) AS Min_Measure,
MAX(Measure) AS Max_Measure,
AVG(Measure) AS Avg_Measure
FROM
My_Measures
GROUP BY
Person_ID
-- Deduplicate names for the JOIN. Given functional dependency, DISTINCT suffices.
), msr AS (
SELECT DISTINCT
Person_ID,
Name
FROM My_Measures
-- Reassociate the names with their IDs.
) SELECT
agg.Person_ID AS Person_ID,
-- ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
msr.Name AS Name,
-- ^^^^^^^^^^^^^^^^^^^^^^^^
agg.Min_Measure AS Min_Measure,
agg.Max_Measure AS Max_Measure,
agg.Avg_Measure AS Avg_Measure
FROM
-- ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
agg INNER JOIN msr
-- ^^^^^^^^^^^^^^
ON agg.Person_ID = msr.Person_ID
Obviously, this squanders significant resources on an unnecessary JOIN
and multiple CTEs, all to to recover data (like Name
) that we originally had!
(4) Take FIRST_VALUE()
Over a PARTITION
I have searched for some equivalent in T-SQL of the first()
function in R. In SQL, such a FIRST()
would simply select the very first value ('Greg'
) from many identical duplicates (('Greg', 'Greg')
) within a GROUP
, without requiring any costly computations. Furthermore, this would work regardless of comparability.
I have stumbled across the FIRST_VALUE()
function, but that seems to require a PARTITION
for each usage, and — in my relative inexperience with optimizing PARTITION
s — I worry about the impact on performance, if many dependent variables must be SELECT
ed.
It also looks ugly. ¯\(ツ)/¯
Question
What is the best way to SELECT
an arbitrary set of dependent variables (like Name
) alongside the grouping variables (like Person_ID
)? Please prioritize performance, but also consider elegance and canonicity and finally extensibility: this should ideally work for all datatypes, even the non-comparable.