2

Let's say I have the following PostgreSQL table called products:

CREATE TABLE IF NOT EXISTS mytable (
    id serial NOT NULL PRIMARY KEY,
    label VARCHAR(50) NOT NULL,
    info jsonb NOT NULL,
    created_at timestamp NOT NULL DEFAULT now()
);

Here is some test data. Note my actual table has millions of records.

INSERT INTO products (label, info) VALUES ('a', '[1, 2, 3]');
INSERT INTO products (label, info) VALUES ('a', '[1, 2, 3]');
INSERT INTO products (label, info) VALUES ('c', '[1, 2, 3]');
INSERT INTO products (label, info) VALUES ('c', '[1, 2, 3]');
INSERT INTO products (label, info) VALUES ('b', '[1, 2, 3]');

I want to write a query that grabs distinct labels and orders the records by the created_at field. My first instinct would be to write the following query:

SELECT DISTINCT ON (label) * FROM products ORDER BY created_at DESC;

However, this fails with the following error:

ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions

Looks like I can fix this using SQL subqueries:

SELECT * FROM (
    SELECT DISTINCT ON (label) * FROM products
) AS subquery ORDER BY created_at DESC;

which generates the following expected result:

 id | label |   info    |         created_at
----+-------+-----------+----------------------------
  5 | b     | [1, 2, 3] | 2022-11-14 03:32:23.245669
  3 | c     | [1, 2, 3] | 2022-11-14 03:32:23.242813
  1 | a     | [1, 2, 3] | 2022-11-14 03:32:23.239791

Is this the best way to fix this issue? Or is there a faster way to query this data? Note, I mention above how my actual table has millions of records so I want to come up with the best query possible.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Johnny Metz
  • 5,977
  • 18
  • 82
  • 146
  • Your last query selects an arbitrary row for each label, then sorts that arbitrary selection by created_at. If that is what you *want* to do, then that query is probably the best way to do it. – jjanes Nov 14 '22 at 13:51

1 Answers1

0

Leading ORDER BY expressions must match DISTINCT ON expressions (and vice versa):

SELECT DISTINCT ON (label) * FROM products ORDER BY label, created_at DESC;

See:

It's unclear whether you want created_at ASC or created_at DESC: you show the latter, but the results you call "expected" match the former - because your "fix" does not do what you seem to think it's doing.

To sort resulting (distinct) rows by created_at DESC, you have to run an outer SELECT with a different sort order:

SELECT *
FROM  (
   SELECT DISTINCT ON (label) *
   FROM   products
   ORDER  BY label, created_at DESC
   ) sub
ORDER  BY created_at DESC;

See:

For a big table, be sure to have an index on (label, created_at) or (label, created_at DESC), respectively.

my actual table has millions of records so I want to come up with the best query possible.

There may be (much) faster solutions depending on undisclosed details. Most importantly:

  • How many distinct "labels" for how many rows in total?
  • Is there a separate table with one row per (relevant) distinct label?
  • Do you actually need SELECT *, or is SELECT label, created_at all you need?
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228