1

i have this table performances, where there are performances of theatre plays from other table:

id | play_id |    when    | other uninteresting meta data
1  |    3    | 2020-04-01 |             ...
2  |    4    | 2020-03-03 |             
3  |    3    | 2020-01-02 |             
4  |    1    | 2020-06-03 |             
5  |    4    | 2020-10-13 |             

and i would like to select the earliest performance for each distinct play (so each play is represented by a single, earliest, performance), ordered from soonest to latest.

so from the presented data i would like to get this:

id | play_id |    when    | other uninteresting meta data
3  |    3    | 2020-01-02 |             ...
2  |    4    | 2020-03-03 |             
4  |    1    | 2020-06-03 |             

so far, after studying some answers here and here i came up with this query


SELECT * FROM
(
  SELECT DISTINCT ON (play_id) *
  FROM performances
  WHERE performances.deleted_at is null
  ORDER BY performances.play_id ASC, performances.when ASC
) distinct_plays
order by distinct_plays.when ASC

however, i don't trust it at all, because in the linked threads there were people bickering and telling that each other's answers are wrong; and while i see some problems with the answers in that thread i don't see a problem in this answer yet.

is this a good query for my task? does it not select duplicate rows or is very ineffective?

GMB
  • 216,147
  • 25
  • 84
  • 135
user151496
  • 1,849
  • 24
  • 38
  • is this mysql or postgresql? please don't tag both without an explanation of why – ysth Sep 05 '20 at 00:39
  • Your query does what you want. What happens when you run it? – GMB Sep 05 '20 at 00:39
  • @ysth it is postgres but i assume mysql has the same syntax for this – user151496 Sep 05 '20 at 00:42
  • 1
    @GMB it returns a good set on my data, i was just unsure about unforseen scenarios (just like people pointed potential duplicate rows in the 2 previous threads) – user151496 Sep 05 '20 at 00:42
  • indeed not, mysql does not have "DISTINCT ON" – ysth Sep 05 '20 at 00:43
  • ok, i have removed the mysql query. however, i would still like to know about a more generic approach, as i will be implementing this into laravel framework's `Eloquent` model which is trying to use database agnostic queries – user151496 Sep 05 '20 at 00:44
  • 1
    I was among the "bickering people" at the [referenced question](https://stackoverflow.com/q/9795660/939860), because the accepted answer had been incorrect. It got fixed after that. Let me assure you: your current query is 100% correct. Detailed explanation for `DISTINCT ON` [here](https://stackoverflow.com/a/7630564/939860). Depending on data distribution, there may be faster solutions. If performance is important, provide details as instructed [here](https://stackoverflow.com/tags/postgresql-performance/info). Maybe in a new question. – Erwin Brandstetter Sep 05 '20 at 03:15
  • @user151496 . . . If you have a question about Postgres, listen to Erwin Brandstetter. – Gordon Linoff Sep 05 '20 at 12:17

2 Answers2

2

Your query does what you want. distinct on is usually the right tool in Postgres to solve such greatest-n-per-group problem... alas, it gives no flexibility on the order of rows in the resultset.

It appears that you want a different sort in the result as in distinct on - so you need another level of nesting for this. While your code does what you want, I will recommend using row_number() instead (which has also the advantage of being supported in many databases, as opposed to vendor-specific distinct on):

SELECT *
FROM (
    SELECT p.*, ROW_NUMBER() OVER(PARTITION BY play_id ORDER BY p.when asc) rn
    FROM performances p
    WHERE p.deleted_at is null
) p
WHERE rn = 1
ORDER BY p.when asc

You might also want to try a correlated subquery:

SELECT p.*
FROM performances p
WHERE p.deleted_at IS NULL AND p.when = (
    SELECT MIN(p1.when) FROM performances p1 WHERE p1.play_id = p.play_id
)
ORDER BY p.when

For performance with the correlated subquery, consider an index on (play_id, when).

user151496
  • 1,849
  • 24
  • 38
GMB
  • 216,147
  • 25
  • 84
  • 135
  • thank you! i will use your first query, as i would like make my applicatoin database independent. however, i have fixed the order to be ascending to give it the results that i want. but your second query is wrong, as it would return unwanted rows with the same `play_ids` in case they have the same `when` value – user151496 Sep 05 '20 at 14:03
1

You can use first_value for that:

select first_value(id) over(w), play_id, first_value(when) over(w) -- the rest of the columns analogously
from performances
group by play_id
window w as (partition by play_id order by when)
Andronicus
  • 25,419
  • 17
  • 47
  • 88
  • 1
    that doesn't get the "other uninteresting meta data" columns, which I believe they want (and want to be from the row with the least when) – ysth Sep 05 '20 at 00:42
  • @ysth `first_value` needs to be used on every column analogously, thanks for the comment – Andronicus Sep 05 '20 at 00:45