PostgreSQL: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

Question

Let's say I have the following PostgreSQL table called products:

CREATE TABLE IF NOT EXISTS mytable (
    id serial NOT NULL PRIMARY KEY,
    label VARCHAR(50) NOT NULL,
    info jsonb NOT NULL,
    created_at timestamp NOT NULL DEFAULT now()
);

Here is some test data. Note my actual table has millions of records.

INSERT INTO products (label, info) VALUES ('a', '[1, 2, 3]');
INSERT INTO products (label, info) VALUES ('a', '[1, 2, 3]');
INSERT INTO products (label, info) VALUES ('c', '[1, 2, 3]');
INSERT INTO products (label, info) VALUES ('c', '[1, 2, 3]');
INSERT INTO products (label, info) VALUES ('b', '[1, 2, 3]');

I want to write a query that grabs distinct labels and orders the records by the created_at field. My first instinct would be to write the following query:

SELECT DISTINCT ON (label) * FROM products ORDER BY created_at DESC;

However, this fails with the following error:

ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

Looks like I can fix this using SQL subqueries:

SELECT * FROM (
    SELECT DISTINCT ON (label) * FROM products
) AS subquery ORDER BY created_at DESC;

which generates the following expected result:

 id | label |   info    |         created_at
----+-------+-----------+----------------------------
  5 | b     | [1, 2, 3] | 2022-11-14 03:32:23.245669
  3 | c     | [1, 2, 3] | 2022-11-14 03:32:23.242813
  1 | a     | [1, 2, 3] | 2022-11-14 03:32:23.239791

Is this the best way to fix this issue? Or is there a faster way to query this data? Note, I mention above how my actual table has millions of records so I want to come up with the best query possible.

Your last query selects an arbitrary row for each label, then sorts that arbitrary selection by created_at. If that is what you *want* to do, then that query is probably the best way to do it. — jjanes, Nov 14 '22 at 13:51

Erwin Brandstetter · Answer 1 · 2022-11-14T04:35:10.280

Leading ORDER BY expressions must match DISTINCT ON expressions (and vice versa):

SELECT DISTINCT ON (label) * FROM products ORDER BY label, created_at DESC;

See:

Select first row in each GROUP BY group?

It's unclear whether you want created_at ASC or created_at DESC: you show the latter, but the results you call "expected" match the former - because your "fix" does not do what you seem to think it's doing.

To sort resulting (distinct) rows by created_at DESC, you have to run an outer SELECT with a different sort order:

SELECT *
FROM  (
   SELECT DISTINCT ON (label) *
   FROM   products
   ORDER  BY label, created_at DESC
   ) sub
ORDER  BY created_at DESC;

See:

PostgreSQL DISTINCT ON with different ORDER BY

For a big table, be sure to have an index on (label, created_at) or (label, created_at DESC), respectively.

my actual table has millions of records so I want to come up with the best query possible.

There may be (much) faster solutions depending on undisclosed details. Most importantly:

How many distinct "labels" for how many rows in total?
Is there a separate table with one row per (relevant) distinct label?
Do you actually need SELECT *, or is SELECT label, created_at all you need?

The query you proposed generates the wrong result. It's not ordered by `created_at` in descending order. — Johnny Metz, Nov 14 '22 at 04:22
Consider the additions. Careful not to confound two distinct questions: which row to pick per label, and how to sort resulting (distinct) rows. — Erwin Brandstetter, Nov 14 '22 at 04:27

PostgreSQL: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

1 Answers1

Linked