3

Background

I am working in a very restricted T-SQL environment, where one may only define the "body" of a VIEW: presumably the ... in

CREATE VIEW My_View AS ...

under the hood. Here is my @@VERSION:

Microsoft SQL Server 2019 (RTM-CU19) (KB5023049) - 15.0.4298.1 (X64)
        Jan 27 2023 16:44:09
        Copyright (C) 2019 Microsoft Corporation
        Web Edition (64-bit) on Linux (Amazon Linux 2)

I should note that these tables are synced from flat files, and as such, no formal schematic structure is preserved from the original source. That is, all "functional dependencies" are merely inferred (albeit reliably) from the column names and business concepts.

Problem

Suppose I have the following table called My_Measures...

Person_ID Name Measure
1 Greg 0
1 Greg 10
2 Devon 20
2 Devon 30

...where Name is functionally dependent on Person_ID.

The Usual

Now suppose I wish to aggregate the Measure into various summary statistics for each person. This is simple enough in SQL...

SELECT
    Person_ID,
    MIN(Measure) AS Min_Measure,
    MAX(Measure) AS Max_Measure,
    AVG(Measure) AS Avg_Measure
FROM 
    My_Measures
GROUP BY 
    Person_ID

...and yields the following results:

Person_ID Min_Measure Max_Measure Avg_Measure
1 0 10 5
2 20 30 25

The Twist

But suppose I wish to include Name alongside each Person_ID, like so:

Person_ID Name Min_Measure Max_Measure Avg_Measure
1 Greg 0 10 5
2 Devon 20 30 25

Obviously, the following attempt...

SELECT
    Person_ID,
--  ⌄⌄⌄⌄⌄
    Name,
--  ^^^^^
    MIN(Measure) AS Min_Measure,
    MAX(Measure) AS Max_Measure,
    AVG(Measure) AS Avg_Measure
FROM 
    My_Measures
GROUP BY 
    Person_ID

...will fail with this error:

Column 'My_Measures.Name' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

Attempts

I have found several unsatisfactory ways to yield the intended output.

(1) GROUP BY Dependent Variables

One way is to GROUP BY the Name column after Person_ID; and more generally, to append the dependent variable(s) at the end of the GROUP BY clause:

SELECT
    Person_ID,
--  ⌄⌄⌄⌄⌄
    Name,
--  ^^^^^
    MIN(Measure) AS Min_Measure,
    MAX(Measure) AS Max_Measure,
    AVG(Measure) AS Avg_Measure
FROM
    My_Measures
GROUP BY
--           ⌄⌄⌄⌄⌄⌄
    Person_ID, Name
--           ^^^^^^

This leaves the grouping unaltered, since the "real" grouping variable(s) (here Person_ID) have already defined it, and the dependent variables are merely "tagging along". However, this wastes performance on processing (arbitrarily many) dependent variables, which may be more complex (CHAR strings for Name) for indexing purposes.

(2) "Aggregate" Dependent Variables

Another way is to "aggregate" the Name column, with some function (like MIN()) that gives us one representative value (like 'Greg') from many identical duplicates (like ('Greg', 'Greg')).

SELECT
    Person_ID,
--  ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
    MIN(Name)    AS Name,
--  ^^^^^^^^^^^^^^^^^^^^^
    MIN(Measure) AS Min_Measure,
    MAX(Measure) AS Max_Measure,
    AVG(Measure) AS Avg_Measure
FROM 
    My_Measures
GROUP BY 
    Person_ID

This likewise achieves the intended result, but it likewise wastes performance on computing an aggregation of many identical values. Furthermore, it will only work on values that are comparable and thus have a MIN(); but it will obviously fail for non-comparable datatypes.

(3) ReJOIN After Aggregating

Perhaps the most disappointing way is to simply calculate the aggregates, and then reassociate the Person_ID with its Name via a JOIN:

-- Aggregate by ID.
WITH agg AS(
    SELECT
        Person_ID,
        MIN(Measure) AS Min_Measure,
        MAX(Measure) AS Max_Measure,
        AVG(Measure) AS Avg_Measure
    FROM 
        My_Measures
    GROUP BY 
        Person_ID
    
-- Deduplicate names for the JOIN. Given functional dependency, DISTINCT suffices.
), msr AS (
    SELECT DISTINCT
        Person_ID,
        Name
    FROM My_Measures
    
-- Reassociate the names with their IDs.
) SELECT
    agg.Person_ID   AS Person_ID,
--  ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
    msr.Name        AS Name,
--  ^^^^^^^^^^^^^^^^^^^^^^^^
    agg.Min_Measure AS Min_Measure,
    agg.Max_Measure AS Max_Measure,
    agg.Avg_Measure AS Avg_Measure
FROM
--      ⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄⌄
    agg INNER JOIN msr
--      ^^^^^^^^^^^^^^
    ON agg.Person_ID = msr.Person_ID

Obviously, this squanders significant resources on an unnecessary JOIN and multiple CTEs, all to to recover data (like Name) that we originally had!

(4) Take FIRST_VALUE() Over a PARTITION

I have searched for some equivalent in T-SQL of the first() function in R. In SQL, such a FIRST() would simply select the very first value ('Greg') from many identical duplicates (('Greg', 'Greg')) within a GROUP, without requiring any costly computations. Furthermore, this would work regardless of comparability.

I have stumbled across the FIRST_VALUE() function, but that seems to require a PARTITION for each usage, and — in my relative inexperience with optimizing PARTITIONs — I worry about the impact on performance, if many dependent variables must be SELECTed.

It also looks ugly. ¯\(ツ)

Question

What is the best way to SELECT an arbitrary set of dependent variables (like Name) alongside the grouping variables (like Person_ID)? Please prioritize performance, but also consider elegance and canonicity and finally extensibility: this should ideally work for all datatypes, even the non-comparable.

Greg
  • 3,054
  • 6
  • 27
  • 3
    You honestly have the solutions; either put a `MAX`/`MIN` around the column or add it to your `GROUP BY`. As for what's more performant, *you* are in the best position to test this on your environment to see what's faster and how the execution plans differ. – Thom A Jun 28 '23 at 16:00
  • 2
    Are you sure that you are using `tsql`? `sql_mode=only_full_group_by` looks like a MySQL thing. – GMB Jun 28 '23 at 16:00
  • 1
    `As such, I cannot imitate this answer in disabling the setting sql_mode=only_full_group_by` - how would you begin doing that given that it is for mysql and you are allegedly using SQL Server? – GSerg Jun 28 '23 at 16:00
  • @GMB Correct, I am using TSQL. I was including `sql_mode=only_full_group_by` as an example of what I _cannot_ do. – Greg Jun 28 '23 at 16:01
  • 2
    As for the "Group by dependent variables": *However, this wastes performance on processing (arbitrarily many) dependent variables*. How do you know that? Did you actually test the solution and faced inefficiency? – GMB Jun 28 '23 at 16:04
  • @TheImpaler I am definitely using **T-SQL**. I included the `sql_mode=...` as a solution in MySQL which I _cannot_ use in my situation. – Greg Jun 28 '23 at 16:06
  • 1
    I think that your statement of *"As such, I cannot imitate [this answer](https://stackoverflow.com/a/45945844) in disabling the MySQL setting `sql_mode=only_full_group_by`."* Is misleading people, @Greg . You can't use the solution because it's not relevant at all; it's for a *completely* different technology. It would be like saying "I can't use this solution in [How do I foo in C#?]" when you're asking how to foo in Java. The statement only adds confusion because *of course* you can't use a solution for MySQL for SQL Server/Sybase; they *aren't* MySQL. – Thom A Jun 28 '23 at 16:09
  • SQL Server does not implement functional dependency, as defined in the SQL Standard, so `NAME` cannot show up unaggregated as you want. According to the SQL Standard this should be possible, nevertheless. I know PostgreSQL does implement this to some extent (and in a correct way, unlike MySQL), so that column could be placed there "as is". – The Impaler Jun 28 '23 at 16:10
  • @ThomA I suppose I was trying to avoid suspicion of a duplicate question with an existing solution. Shall I simply delete that line and replace it with my results for `@@VERSION` ? – Greg Jun 28 '23 at 16:12
  • Now, if the column is comparable or not you could use `MIN()` or `FIRST_VALUE()` or other. If you want to prioritize performance, then the only way of finding out which one is better is to "run your horses". Try them out, and see how long they take to execute, and chech the execution plan. – The Impaler Jun 28 '23 at 16:12
  • KNowing your version could be useful, yes, @Greg . – Thom A Jun 28 '23 at 16:12
  • It is not functionally dependent if there is a single table `My_Measures` that happens to contain data in an unnormalised form anyway. In reality do these come from joining other tables? If so what is the definition of those? – Martin Smith Jun 28 '23 at 16:13
  • 1
    @MartinSmith Yes, I agree. The OP indicates that the column is functional dependent, but as you say that would depend on the correct keys and view definition. – The Impaler Jun 28 '23 at 16:14
  • 1
    Usually in SQL Server you would push the aggregation down to the underlying tables and join the aggregated result rather than aggregating after the join. This is especially important for things like SUM anyway so you don't end up summing the same values repeatedly when a row joins to many – Martin Smith Jun 28 '23 at 16:19
  • SQL Server is normally clever enough to realize there is a functional dependency and therefore not re-group the data. So adding the extra column to `GROUP BY` is fine. Furthermore, if you use an aggregate function it will also realize there can only be one result and not actually aggregate. – Charlieface Jun 28 '23 at 16:26
  • @Charlieface Unfortunately, the "functional dependency" I mention is really conceptual in nature. You and I can detect it by glancing at the data. However, the data is synced to my environment from _flat files_, without any schematic specifications or even (I believe) a formal `PRIMARY KEY`. So unless SQL Server performs its own analysis of the data, and detects dependencies automatically, it has nothing to work with. – Greg Jun 28 '23 at 16:29
  • 1
    I'd go for your method 1 then. It will work for things like `bit` that don't support `min` and `max` and won't hide the problem if your assumptions about dependencies are not correct and in reality you have inconsistent names for an id – Martin Smith Jun 28 '23 at 16:32
  • 2
    Then either option will still work, because either way it needs to sort the data if there is no existing index to provide sorted data. If there is only a key on `Person_ID` then stick to option 2. The results will be the same, and again the performance concern you are worried about is not a big deal. Primary perf concerns are things like re-sorting data, or hash joins etc. A `MAX` aggregate is not big in performance concerns. – Charlieface Jun 28 '23 at 16:36
  • I know this is an example, but fwiw the specific problem here is the result of a schema error, where the dependant `name` value _should not be stored in that table in the first place._ Instead, you would JOIN these results to the main Person table to find the name. This can be done separately from the aggregates, either via a CTE or traditional subquery. But the question in general is still valid, because there are times when the dependent data is legitimately already there. – Joel Coehoorn Jun 28 '23 at 16:40
  • @JoelCoehoorn The thought occurred to me. Unfortunately, all this data comes from _flat files_ (CSVs) synced to our system, so the data is fairly denormalized and (to my knowledge) has **absolutely no schematic structure defined**. I have to infer the dependencies from the column names and business knowledge. **But as memory becomes cheaper and performance is at a premium, I expect "queriers" will encounter 2NF quite routinely.** So this issue will likely continue to pop up, and a canonical solution will remain relevant. – Greg Jun 28 '23 at 16:53
  • @JoelCoehoorn _"Instead, you would JOIN these results to the main Person table to find the name. This can be done separately from the aggregates, either via a CTE or traditional subquery."_. I have included ([here](https://stackoverflow.com/posts/76574874/timeline#history_204a541f-be8a-44f3-b543-90eeea9410f4)) the `JOIN` approach as a third solution. This might be an accepted approach, but I initially excluded it because it felt so darn _disappointing_ to perform a whole `JOIN`, just to recover data that had already been _there_. – Greg Jun 28 '23 at 16:58
  • Everything except group by... Name or max(name) will be worse,do don't overthink this. Personally I'd go for group by name,because max(name) would have me thinking there's more than one name per ID. – siggemannen Jun 28 '23 at 18:39
  • @siggemannen _"max(name) would have me thinking there's more than one name per ID."_ Do you mean that the `MAX(Name)` language might confuse either **(1)** a newcomer or **(2)** the original author revisiting the query? Or do you mean _you_ are unsure? There is certainly only one `Name` per `Person_ID`, and the `MAX()` of `('Greg', 'Greg')` is just a trick to obtain `'Greg'`. – Greg Jun 28 '23 at 18:59
  • 1
    Newcomers and original authors are often the same people: I often forget what I wrote a few weeks back, so I try to be clear always. – Charlieface Jun 28 '23 at 19:25
  • Exactly what @Charlieface wrote, both :) Especially since your data is unstructured, you want to keep things simple, introducing weird aggregates will make people worry they're missing something. An alternative is to write a comment but it might be overkill for this kind of stuff. – siggemannen Jun 28 '23 at 19:29
  • 1
    Based on YOUR examples Re-joining will not work as it will result in duplicates. (The same `Person_ID ` appears more than once in the source table). – Alex Jun 28 '23 at 23:23
  • @Alex Good catch! I have [updated](https://stackoverflow.com/posts/76574874/timeline#history_7ed25fed-487d-45b2-95a7-0584d22bc20d) my query to prevent duplicates. – Greg Jun 28 '23 at 23:48
  • SQL Server is ill suited for "find a few gems in a pile of poo" problems. As you have discovered most solutions boil down to "pick your poison". If you have to live with this and need to build a system on top (as opposed to a few ad-hoc queries), why not try to normalize the data? One way (there are others) is to build materialized views (similar to what you have done with your CTEs) and then use them to build your queries. Data updates would potentially be slow though. – Alex Jun 29 '23 at 00:44
  • @Alex _"...why not try to normalize the data?"_ I'm afraid I lack those permissions. The data is synced from an ERP, which exports flat CSV files in (I believe) 2nd normal form. These CSVs are saved to a dedicated server, from which my application then imports this data in regular syncs. I can't access the ERP schema, and the flat datasets themselves (along with the `VIEW`s) are merely sources for the app's highly restrictive GUI. This GUI links entities from these sources, to output tables in **exclusively** 1st normal form, which the app makes available for analysis and summary, also via GUI – Greg Jun 29 '23 at 15:22
  • @Alex _"One way (there are others) is to build materialized views...and then use them to build your queries."_ I believe my `VIEW`s are like "materialized views", as their data is stored for later use; they can be used by other of my `VIEW`s and be linked as entities to the 1NF tables output by the GUI. Since the only outputs _are_ such tables, then I suspect it hurts performance to `SELECT DISTINCT` from several of entities, to model a normalized database with `VIEW`s alone...only to have to `JOIN` them once again. _"Data updates would potentially be slow though."_ Welcome to my life! – Greg Jun 29 '23 at 15:31
  • [Materialized view definition](https://en.wikipedia.org/wiki/Materialized_view) to ensure that we are not talking apples and oranges. "These CSVs are saved to a dedicated server, from which my application then imports this data in regular syncs. " - you can customise import process to include data normalisation, can you not? If you cannot then I was suggesting you try "Materialized views". I don't think I can help you any further with this. Good luck! – Alex Jun 29 '23 at 21:59

0 Answers0