84

ActionView::Template::Error (PG::Error: ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list

I'm creating an events website and I'm trying to sort the rendered rsvps by the start time of the event. There are a lot of RSVPS so I'm grouping them with distinct, but I've been having a lot of difficulty over the last few days on sorting the results without this error popping up on PG. I've looked at some of the previous questions on the topic and am still pretty lost. How can I get this to work? Thank you so much!

@rsvps = Rsvp.where(:voter_id => current_user.following.collect {|f| f["id"]}, :status => 'going').where("start_time > ? AND start_time < ?", Time.now, Time.now + 1.month).order("count_all desc").count(:group => :event_id).collect { |f| f[0] }

<%= render :partial => 'rsvps/rsvp', :collection => Rsvp.where(:event_id => @rsvps).select("DISTINCT(event_id)").order('start_time asc') %>
Tomasz Jakub Rup
  • 10,502
  • 7
  • 48
  • 49
Andrew
  • 1,167
  • 1
  • 10
  • 16

6 Answers6

120

I know this is a rather old question, but I just went through a small example in my head which helped me understand why Postgres has this seemingly odd restriction on SELECT DISTINCT / ORDER BY columns.

Imagine you have the following data in your Rsvp table:

 event_id |        start_time
----------+------------------------
    0     | Mar 17, 2013  12:00:00
    1     |  Jan 1, 1970  00:00:00
    1     | Aug 21, 2013  16:30:00
    2     |  Jun 9, 2012  08:45:00

Now you want to grab a list of distinct event_ids, ordered by their respective start_times. But where should 1 go? Should it come first, because the one tuple starts on Jan 1, 1970, or should it go last because of the Aug 21, 2013?

As the database system can't make that decision for you and the syntax of the query can't depend on the actual data it might be operating on (assuming event_id is unique), we are restricted to ordering only by columns from the SELECT clause.

As for the actual question - an alternative to Matthew's answer is using an aggregate function like MIN or MAX for the sorting:

  SELECT event_id
    FROM Rsvp
GROUP BY event_id
ORDER BY MIN(start_time)

The explicit grouping and aggregation on start_time permit the database to come up with a unambiguous ordering of the result tuples. Note however, that readability is definitely an issue in this case ;)

AdrianoKF
  • 2,891
  • 1
  • 25
  • 31
  • 2
    Moreover the ordering by MIN or MAX does not work. It just gives the same error. – Jagger Sep 13 '17 at 07:00
  • @Jagger, you might be referring to an older version of MySQL - since version 5.7.5 MySQL enables ONLY_FULL_GROUP_BY by default, which disallows this exact scenario and produces an error message (`Expression #1 of ORDER BY clause is not in GROUP BY clause and contains nonaggregated column 'test.rsvp.start_time' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by`) This change is documented in the [MySQL 5.7 changelog](https://dev.mysql.com/doc/refman/5.7/en/sql-mode.html#sqlmode_only_full_group_by) – AdrianoKF Sep 13 '17 at 12:59
  • @AdrianoKF Strange, because I am using `mysql-5.7.17-winx64`. – Jagger Sep 13 '17 at 13:03
  • Weird, I just tried to reproduce your behaviour and got the above error message - did you maybe add a PK/unique constraint on the event_id column? We can continue this discussion in chat, if you'd like to! – AdrianoKF Sep 13 '17 at 13:05
  • 2
    @AdrianoKF I have just tested that with the newest version of MySQL `mysql-8.0.2-dmr-winx64` (development one). There is indeed an error now in MySQL like in PostgreSQL. Anyway, your query with MIN and MAX as workaround does not work either in PostgreSQL. – Jagger Sep 13 '17 at 20:10
  • it works but I needed to add "asc" at the end. At least with postgresql & H2 – Enrico Giurin Sep 11 '20 at 02:21
93

The ORDER BY clause can only be applied after the DISTINCT has been applied. Since only the fields in the SELECT statement are taken into consideration for the DISTINCT operations, those are the only fields may be used in the ORDER BY.

Logically, if you just want a distinct list of event_id values, what order they occur in should be irrelevant. If order does matter, then you should add the start_time to the SELECT list so that there is context for the order.

Also, these two SELECT clauses are NOT equivalent, so be careful:

SELECT DISTINCT(event_id, start_time) FROM ...

SELECT DISTINCT event_id, start_time FROM ...

The second is the form you want. The first will return a series of records with the data represented as a ROW construct (a single column with a tuple inside). The second will return normal columns of data output. It only works as expected in the single-column case where the ROW construct is reduced down since it is only a single column.

Matthew Wood
  • 16,017
  • 5
  • 46
  • 35
  • 6
    wow you are awesome! great explanation and you just added a few days to my life :D – Andrew Oct 02 '12 at 15:40
  • 5
    "Logically, if you just want a distinct list of event_id values, what order they occur in should be irrelevant." - I've seen an app, in which there's order by and limit, so let's say only 10 first and distinct items (in this case events) shall be selected. – Piohen Jul 02 '14 at 10:54
  • I had the same problem in Hibernate. The cause was that I wrote `... ORDER BY time` instead of `... ORDER BY obj.time`. – Aron Lorincz Aug 17 '15 at 08:49
  • "The ORDER BY clause can only be applied after the DISTINCT has been applied.". Funny this is not of any problem for MySQL. – Jagger Sep 13 '17 at 06:45
  • Hi @Matthew Wood, I have a similar issue here, but relating to jsonb ordering: https://stackoverflow.com/questions/54305695/postgres-distinct-order-by-a-jsonb-key-throws-a-rails-pginvalidcolumnreferen – Martin Verdejo Jan 22 '19 at 10:27
  • "The ORDER BY clause can only be applied after the DISTINCT has been applied."; "can only be applied" is probably wrong here, it should be "is applied". In theory it should be possible to first order, then aggregate distinct. Though the DB probably does not do this, maybe for performance reasons because ordering can be expensive? – Marcono1234 Jun 29 '20 at 14:12
7

Syntactic vs logical order of operations

I think that the confusion around the relationship between DISTINCT and ORDER BY (or also GROUP BY, for that matter), can only really be understood if the logical order of operations in SQL is understood. It is different from the syntactic order of operations, which is the primary source of confusion.

In this example, it looks as though DISTINCT is related to SELECT, given its syntactic closeness, but it's really an operator that is applied after SELECT (the projection). Due to the nature of what DISTINCT does (remove duplicate rows), all the not-projected contents of a row are no longer available after the DISTINCT operation, which includes the ORDER BY clause. According to the logical order of operations (simplified):

  • FROM (produces all possible column references)
  • WHERE (can use all column references from FROM)
  • SELECT (can use all column references from FROM, and create new expressions, and alias them)
  • DISTINCT (operates on the tuple projected by SELECT)
  • ORDER BY (depending on the presence of DISTINCT, can operate on the tuple projected by SELECT, and if DISTINCT is absent *perhaps (depending on the dialect) also on other expressions)

What about DISTINCT and ORDER BY

The fact that, without DISTINCT, ORDER BY can access (in some dialects) also things that haven't been projected may be a bit weird, certainly useful. E.g. this works:

WITH emp (id, fname, name) AS (
  VALUES (1, 'A', 'A'),
         (2, 'C', 'A'),
         (3, 'B', 'B')
)
SELECT id
FROM emp
ORDER BY fname DESC

dbfiddle here. Producing

id
--
2
3
1

This changes when you add DISTINCT. This no longer works:

WITH emp (id, fname, name) AS (
  VALUES (1, 'A', 'A'),
         (2, 'C', 'A'),
         (3, 'B', 'B')
)
SELECT DISTINCT name
FROM emp
ORDER BY fname DESC

dbfiddle here. The error being:

ERROR: for SELECT DISTINCT, ORDER BY expressions must appear in select list LINE 8: ORDER BY fname DESC

Because what fname value would you attribute to name = A? A or C? The answer would decide whether you'd be getting A, B as a result or B, A. It cannot be decided.

PostgreSQL DISTINCT ON

Now, as mentioned in the above linked article, PostgreSQL supports an exception to this, which can occasionally be useful: DISTINCT ON (see also questions like these):

WITH emp (id, fname, name) AS (
  VALUES (1, 'A', 'A'),
         (2, 'C', 'A'),
         (3, 'B', 'B')
)
SELECT DISTINCT ON (name) id, fname, name
FROM emp
ORDER BY name, fname, id

dbfiddle here, producing:

id |fname|name
---|-----|----
1  |A    |A   
3  |B    |B   

This query allows to produce only distinct values of name, and then per duplicate row, take the first one given the ORDER BY clause, which makes the choice per distinct group unambiguous. This can be emulated in other RDBMS using window functions.

Lukas Eder
  • 211,314
  • 129
  • 689
  • 1,509
2

Because, you are using start_time column, you can use row_number() which is one of Window Functions of PostgreSQL and stack it in

  • order of start_time, if you are expecting row values with first start_time

    Select event_id from (SELECT event_id ,ROW_NUMBER() OVER(PARTITION BY event_id ORDER BY start_time) AS first_row FROM Rsvp) where first_row = 1

  • reverse order of start_time, if you are expecting row values with last start_time

    Select event_id from (SELECT event_id ,ROW_NUMBER() OVER(PARTITION BY event_id ORDER BY start_time desc) AS last_row FROM Rsvp) where last_row = 1

You can also use different Window Function as per your requirements.

SUKUMAR S
  • 187
  • 2
  • 8
0

this GitHub answer gave me the insight that I was not able to find in the other answsers posted here.

Bottom line, you need to have the same logic as the ORDER BY statement in the SELECT statement.

In the example given they are using a CASE statement to make a custom order.

However, in the example they have the CASE statement twice and that is not necessary, the SQL can be simplified to:

  SELECT DISTINCT pet.name, pet.alive, CASE WHEN pet.alive THEN 1 ELSE 0 
  END AS custom_order
    FROM (values ('a', true), ('b', true), ('c', false)) AS pet(name, alive)
   ORDER BY custom_order;

Assuming this was a model in Rails the example can be written as:

  Pet.select(:name, :alive, Arel.sql("CASE WHEN alive THEN 1 ELSE 0 
  END AS custom_order")).order(:custom_order)

If you don't need the custom name on the sort order, ActiveRecord will name it case which can be used to sort as well

  Pet.select(:name, :alive, Arel.sql("CASE WHEN alive THEN 1 ELSE 0 
  END")).order(:case)
SMAG
  • 652
  • 6
  • 12
-1

I was able to solve this error by adding a column in my select and then using ORDER BY on that column. I had SELECT DISTINCT concat(dl.FirstName, concat(' ', dl.LastName)) as displayName, ... and I wanted to ORDER BY last name (as one does). I tried every permutation I could think of for adding the ORDER BY within the SELECT, but what worked was just adding , dl.LastName as lastName then adding ORDER BY lastName ASC at the end of the query as in a more typical query.

I ended up with an additional column, but it's easy to ignore in my application.

matsad
  • 318
  • 7
  • 12