Why can't I exclude dependent columns from `GROUP BY` when I aggregate by a key?

Question

If I have the following tables (as an example using PostgreSQL, but could be any other relational database), where car has two keys (id and vin):

create table car (
  id int primary key not null,
  color varchar(10),
  brand varchar(10),
  vin char(17) unique not null
);

create table appraisal (
  id int primary key not null,
  recorded date not null,
  car_id int references car (id),
  car_vin char(17) references car (vin),
  price int
);

I can successfully include c.color and c.brand in the select list without aggregating them, since they depend on c.id:

select 
  c.id, c.color, c.brand,
  min(price) as min_appraisal,
  max(price) as max_appraisal
from car c
left join appraisal a on a.car_id = c.id
group by c.id; -- c.color, c.brand are not needed here

However, the following query fails since it doesn't allow me to include c.color and c.brand in the select list, even though it does depend on c.vin (that is a key) of the table.

select 
  c.vin, c.color, c.brand,
  min(price) as min_appraisal,
  max(price) as max_appraisal
from car c
left join appraisal a on a.car_vin = c.vin
group by c.vin; -- Why are c.color, c.brand needed here?

Error: ERROR: column "c.color" must appear in the GROUP BY clause or be used in an aggregate function Position: 18

Example in DB Fiddle.

I think the claim that this "could be any other relational database" weakens the question somewhat, because this appears to be a specific implementation decision in PostgreSQL. — IMSoP, May 30 '20 at 14:31

Erwin Brandstetter · Accepted Answer · 2020-05-30T13:37:35.380

Because only the PK covers all columns of an underlying table in the GROUP BY clause. Hence your first query works. A UNIQUE constraint does not.

The combination of a non-deferrable UNIQUE and a NOT NULL constraint would also qualify. But that's not implemented - as well as some other functional dependencies known to the SQL standard. Peter Eisentraut, the principal author of the feature, had more in mind, but it was determined at the time that the demand is low and associated costs might be high. See the discussion about the feature on pgsql-hackers.

The manual:

When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.

And more explicitly:

PostgreSQL recognizes functional dependency (allowing columns to be omitted from GROUP BY) only when a table's primary key is included in the GROUP BY list. The SQL standard specifies additional conditions that should be recognized.

Since c.vin is UNIQUE NOT NULL, you can fix your second query by using the PK column instead:

...
group by c.id;

Aside, while referential integrity is enforced and the whole table is queried, both of the given queries can be substantially cheaper: aggregate rows in appraisal before the join. This removes the need to GROUP BY in the outer SELECT a priori. Like:

SELECT c.vin, c.color, c.brand
     , a.min_appraisal
     , a.max_appraisal
FROM   car c
LEFT   JOIN (
   SELECT car_vin
        , min(price) AS min_appraisal
        , max(price) AS max_appraisal
   FROM   appraisal
   GROUP  BY car_vin
   ) a ON a.car_vin = c.vin;

See:

Multiple array_agg() calls in a single query

@philipxy Do you know of any section of the SQL Standard that could address FDs? I was expecting UNIQUE constraint to behave like PKs... In fact I have a lot of trouble finding any difference between them now. — The Impaler, May 30 '20 at 03:01
@TheImpaler It's in the standard draft available at the Wikipedia SQL article, including in the list/table at the end of optional functionality packages. (Part 1 and/or 2.) The package specifies that certain inferencing is done. (A DBMS could do more.) I'm not sure what you're trying to say about UNIQUE; as a constraint PK just means UNIQUE NOT NULL; but beware SQL Server & MySQL (in different ways) & maybe other DBMSs don't implement UNIQUE under NULLs per the standard. — philipxy, May 30 '20 at 03:05
@TheImpaler 4.19 Functional Dependencies; Annex F (informative) SQL feature taxonomy Table 39 — Feature taxonomy for optional features 270 T301 Functional dependencies; see the index for more mention of FDs. http://www.wiscorp.com/sql20nn.zip 7IWD2-02-Foundation-2011-12.pdf — philipxy, May 30 '20 at 03:24
@ErwinBrandstetter Your reference to the authors in 2010 is really great. That puts my mind to rest on this. It's not that it's not possible or it doesn't make sense; it's just because it was an implementation decision, a cost/benefit decision at the time. Thanks. — The Impaler, May 30 '20 at 15:26
What I meant about "Postgres does not fully implement the SQL standard here" was that it would have been clearer to say "Postgres does not fully implement the SQL standard relevant optional functionality here". The current version still doesn't make clear that the "should be recognized" leaves out "if one claims conformance to the optional functionality". Which Postgres does not. ["T301 | Functional dependencies | partially supported"](https://www.postgresql.org/docs/12/unsupported-features-sql-standard.html) cc @TheImpaler — philipxy, May 31 '20 at 02:28
@philipxy: OK, I see the fine point - but it seems rather academic. The word "should" in the quote indicates the optional nature, but doesn't clarify exactly. You might bring this up with pgsql-docs@lists.postgresql.org if you think it's worth clarifying ... — Erwin Brandstetter, May 31 '20 at 02:44

Why can't I exclude dependent columns from `GROUP BY` when I aggregate by a key?

1 Answers1

Linked