selecting DISTINCT rows that are ordered by something different

Question

i have this table performances, where there are performances of theatre plays from other table:

id | play_id |    when    | other uninteresting meta data
1  |    3    | 2020-04-01 |             ...
2  |    4    | 2020-03-03 |             
3  |    3    | 2020-01-02 |             
4  |    1    | 2020-06-03 |             
5  |    4    | 2020-10-13 |

and i would like to select the earliest performance for each distinct play (so each play is represented by a single, earliest, performance), ordered from soonest to latest.

so from the presented data i would like to get this:

id | play_id |    when    | other uninteresting meta data
3  |    3    | 2020-01-02 |             ...
2  |    4    | 2020-03-03 |             
4  |    1    | 2020-06-03 |

so far, after studying some answers here and here i came up with this query


SELECT * FROM
(
  SELECT DISTINCT ON (play_id) *
  FROM performances
  WHERE performances.deleted_at is null
  ORDER BY performances.play_id ASC, performances.when ASC
) distinct_plays
order by distinct_plays.when ASC

however, i don't trust it at all, because in the linked threads there were people bickering and telling that each other's answers are wrong; and while i see some problems with the answers in that thread i don't see a problem in this answer yet.

is this a good query for my task? does it not select duplicate rows or is very ineffective?

is this mysql or postgresql? please don't tag both without an explanation of why — ysth, Sep 05 '20 at 00:39
Your query does what you want. What happens when you run it? — GMB, Sep 05 '20 at 00:39
@ysth it is postgres but i assume mysql has the same syntax for this — user151496, Sep 05 '20 at 00:42
@GMB it returns a good set on my data, i was just unsure about unforseen scenarios (just like people pointed potential duplicate rows in the 2 previous threads) — user151496, Sep 05 '20 at 00:42
ok, i have removed the mysql query. however, i would still like to know about a more generic approach, as i will be implementing this into laravel framework's `Eloquent` model which is trying to use database agnostic queries — user151496, Sep 05 '20 at 00:44
I was among the "bickering people" at the [referenced question](https://stackoverflow.com/q/9795660/939860), because the accepted answer had been incorrect. It got fixed after that. Let me assure you: your current query is 100% correct. Detailed explanation for `DISTINCT ON` [here](https://stackoverflow.com/a/7630564/939860). Depending on data distribution, there may be faster solutions. If performance is important, provide details as instructed [here](https://stackoverflow.com/tags/postgresql-performance/info). Maybe in a new question. — Erwin Brandstetter, Sep 05 '20 at 03:15
@user151496 . . . If you have a question about Postgres, listen to Erwin Brandstetter. — Gordon Linoff, Sep 05 '20 at 12:17

score 2 · Accepted Answer · edited Sep 05 '20 at 15:05

Your query does what you want. distinct on is usually the right tool in Postgres to solve such greatest-n-per-group problem... alas, it gives no flexibility on the order of rows in the resultset.

It appears that you want a different sort in the result as in distinct on - so you need another level of nesting for this. While your code does what you want, I will recommend using row_number() instead (which has also the advantage of being supported in many databases, as opposed to vendor-specific distinct on):

SELECT *
FROM (
    SELECT p.*, ROW_NUMBER() OVER(PARTITION BY play_id ORDER BY p.when asc) rn
    FROM performances p
    WHERE p.deleted_at is null
) p
WHERE rn = 1
ORDER BY p.when asc

You might also want to try a correlated subquery:

SELECT p.*
FROM performances p
WHERE p.deleted_at IS NULL AND p.when = (
    SELECT MIN(p1.when) FROM performances p1 WHERE p1.play_id = p.play_id
)
ORDER BY p.when

For performance with the correlated subquery, consider an index on (play_id, when).

thank you! i will use your first query, as i would like make my applicatoin database independent. however, i have fixed the order to be ascending to give it the results that i want. but your second query is wrong, as it would return unwanted rows with the same `play_ids` in case they have the same `when` value — user151496, Sep 05 '20 at 14:03

Andronicus · Answer 2 · 2020-09-05T00:43:55.760

1

You can use first_value for that:

select first_value(id) over(w), play_id, first_value(when) over(w) -- the rest of the columns analogously
from performances
group by play_id
window w as (partition by play_id order by when)

edited Sep 05 '20 at 00:43

answered Sep 05 '20 at 00:40

Andronicus

25,419
17
47
88

1

that doesn't get the "other uninteresting meta data" columns, which I believe they want (and want to be from the row with the least when) – ysth Sep 05 '20 at 00:42
@ysth `first_value` needs to be used on every column analogously, thanks for the comment – Andronicus Sep 05 '20 at 00:45

selecting DISTINCT rows that are ordered by something different

2 Answers2