SELECT DISTINCT + ORDER BY additional expression

Question

I have no experience with PostgreSQL and I am migrating a Rails5+MySQL application to Rails5+PostgreSQL and I am having a problem with a query.

I've already looked at some questions/answers and still haven't been able to solve my problem. My problem seems to be ridiculous, but I needed to ask for help here!

Query:

SELECT DISTINCT users.* FROM users 
INNER JOIN areas_users ON areas_users.user_id = users.id 
INNER JOIN areas ON areas.deleted_at IS NULL AND areas.id = areas_users.area_id 
WHERE users.deleted_at IS NULL AND users.company_id = 2 AND areas.id IN (2, 4, 5) 
ORDER BY CASE WHEN users.id=3 THEN 0 WHEN users.id=5 THEN 1 END, users.id, 1 ASC

Running the query in DBeaver, returns the error:

SQL Error [42P10]: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list

What do I need to do to be able to use this SELECT DISTINCT with this ORDER BY CASE?

In a table called `users` I would expect all rows to have different ids. Why do you need DISTINCT? — forpas, Aug 09 '21 at 21:07

Erwin Brandstetter · Accepted Answer · 2021-08-10T21:39:29.510

It's like error message says:

for SELECT DISTINCT, ORDER BY expressions must appear in select list

This is an expression:

CASE WHEN users.id=3 THEN 0 WHEN users.id=5 THEN 1 END

You cannot order by it, while doing SELECT DISTINCT users.* FROM ... because that only allows ORDER BY expressions that appear in the SELECT list.

Typically, the best solution for DISTINCT is not to use it in the first place. If you don't duplicate rows, you don't have to de-duplicate them later. See:

How to speed up select distinct?

In your case, use an EXISTS semi-join (expression / subquery) instead of the joins. This avoids the duplication. Assuming distinct rows in table users, DISTINCT is out of job.

SELECT u.*
FROM   users u
WHERE  u.deleted_at IS NULL
AND    u.company_id = 2
AND    EXISTS (
   SELECT FROM areas_users au JOIN areas a ON a.id = au.area_id
   WHERE  au.user_id = u.id
   AND    a.id IN (2, 4, 5)
   AND    a.deleted_at IS NULL
   )
ORDER BY CASE u.id WHEN 3 THEN 0
                   WHEN 5 THEN 1 END, u.id, 1;  -- ①

Does what you request, and typically much faster, too.

Using simple ("switched") CASE syntax.

① There is still an ugly bit. Using a positional reference in ORDER BY can be convenient short syntax. But while you have SELECT *, it's a really bad idea. If the order of columns in the underlying table changes, your query is silently changed. Spell out the column in this use case!

(Typically, you don't need SELECT * in the first place, but just a selection of columns.)

IF your ID column is guaranteed to have positive numbers, this would be a bit faster:

...
ORDER BY CASE u.id WHEN 3 THEN -2
                   WHEN 5 THEN -1
                   ELSE u.id END, <name_of_first_column>

I MUST use `DISTINCT`

(Really?) If you insist:

SELECT DISTINCT CASE u.id WHEN 3 THEN -2 WHEN 5 THEN -1 ELSE u.id END AS order_column, u.*
FROM   users u
JOIN   areas_users au ON au.user_id = u.id
JOIN   areas a ON a.id = au.area_id
WHERE  u.deleted_at IS NULL
AND    u.company_id = 2
AND    a.id IN (2, 4, 5)
AND    a.deleted_at IS NULL
ORDER  BY 1, <name_of_previously_first_column>;  -- now, "ORDER BY 1" is ok

You get the additional column order_column in the result. You can wrap it in a subquery with a different SELECT ...

Just a proof of concept. Don't use this.

Or `DISTINCT ON`?

SELECT DISTINCT ON (CASE u.id WHEN 3 THEN -2 WHEN 5 THEN -1 ELSE u.id END, <name_of_first_column>)
       u.*
FROM   users u
JOIN   areas_users au ON au.user_id = u.id
JOIN   areas a ON a.id = au.area_id
WHERE  u.deleted_at IS NULL
AND    u.company_id = 2
AND    a.id IN (2, 4, 5)
AND    a.deleted_at IS NULL
ORDER  BY CASE u.id WHEN 3 THEN -2 WHEN 5 THEN -1 ELSE u.id END, <name_of_first_column>;

This works without returning an additional column. Still just proof of concept. Don't use it, the EXISTS query is much cheaper.

See:

Select first row in each GROUP BY group?

You write: `a larger query with some JOINs and does a different ordering with some CASEs, but to simplify the query:` Chances are, you removed the reason for your `DISTINCT` by over-simplifying. Or you do not need `DISTINCT` to begin with. Provide the original query, the (core) definition of `CREATE TABLE users ...` and your version of Postgres, in the question. If `DISTINCT` is actually meaningful, you need to define which row to pick from each set of duplicates. Then this should be solved swiftly. — Erwin Brandstetter, Aug 09 '21 at 21:48
Using the query I commented on in the question, the problem already happens, depending on the JOIN it can bring duplicate users, so I use DISTINCT. The original query is: ```SELECT DISTINCT "users".* FROM "users" INNER JOIN "areas_users" ON "areas_users"."user_id" = "users"."id" INNER JOIN "areas" ON "areas"."deleted_at" IS NULL AND "areas"."id" = "areas_users"."area_id" WHERE "users"."deleted_at" IS NULL AND "users"."company_id" = 2 AND "areas"."id" IN (2, 4, 5) ORDER BY CASE WHEN users.id=3 THEN 0 WHEN users.id=5 THEN 1 END, users.id, 1 ASC``` — Marcus, Aug 09 '21 at 21:58
I use this id = 3 and id = 5 because I need these records to come first — Marcus, Aug 09 '21 at 21:59
@Marcus: Please [edit] your question and add this query to it. I added a solution to my answer. — Erwin Brandstetter, Aug 09 '21 at 23:19
thx! Good tip of the negative number, I didn't know it's faster! Out of curiosity, in PostgreSQL there is no possibility to use DISTINCT in this query like MySQL? I ask because my API uses Rails `.distinct` by default — Marcus, Aug 10 '21 at 14:33
Sure, you can make your query work with `DISTINCT` if you insist. But that's a bad idea for this query. There are plenty of other good use cases. Or with the more versatile `DISTINCT ON`. See: https://stackoverflow.com/a/7630564/939860 Just like there are plenty of good use cases for a shotgun. But shooting yourself in the foot is not among them. If that "API" really uses `.distinct` by default, that's remarkably poor design. — Erwin Brandstetter, Aug 10 '21 at 16:10

SELECT DISTINCT + ORDER BY additional expression

1 Answers1

I MUST use DISTINCT

Or DISTINCT ON?

I MUST use `DISTINCT`

Or `DISTINCT ON`?