In one to many relationship, return distinct rows based on MIN value

Question

Let's say a patient makes many visits. I want to write a query that returns distinct patient rows based on their earliest visit. For example, consider the following rows.

patients
-------------
id    name
1     Bob
2     Jim
3     Mary

visits
-------------
id    patient_id    visit_date    reference_number
1     1             6/29/14       09f3be26
2     1             7/8/14        34c23a9e
3     2             7/10/14       448dd90a

What I want to see returned by the query is:

id    name    first_visit_date    reference_number
1     Bob     6/29/14             09f3be26
2     Jim     7/10/14             448dd90a

What I've tried looks something like:

SELECT
  patients.id, 
  patients.name, 
  visits.visit_date AS first_visit_date, 
  visits.reference_number
FROM
  patients
INNER JOIN (
  SELECT
    *
  FROM
    visits
  ORDER BY
    visit_date
  LIMIT
    1
) visits ON
  visits.patient_id = patients.id

Adding the LIMIT causes the query to return 0 rows, but removing it causes the query to return duplicates. What's the trick here? I've also tried selecting MIN(visit_date) in the INNER JOIN but that's also returning dups.

Update

It's been suggested that this question is a duplicate, but to me it seems different because I'm doing this over two separate tables. The accepted answer on the other question suggests joining on y.max_total = x.total, which works if the table being joined is the same table being selected from. Additionally, I need to return other columns from the row with the MIN date, not just the date itself.

The answer I've accepted works great, however.

http://stackoverflow.com/questions/25531454/using-aliases-in-sql-inner-join/25531637#25531637 — Bulat, Aug 28 '14 at 14:44

wildplasser · Answer 1 · 2014-08-28T20:02:27.283

Avoiding the DISTINCT ON(p.id), but using a plain old NOT EXISTS(...) instead

SELECT p.id, p.name
     , v.first_visit_date, v.reference_number
FROM patients p
JOIN visits v ON p.id = v.patient_id
    -- exclude all join-products that are not the first for a patient.
WHERE NOT EXISTS (
   SELECT *
   FROM visits nx
   WHERE nx.patient_id = v.patient_id
   AND ( nx.visit_date < v.visit_date
       OR (nx.visit_date = v.visit_date AND nx.id < v.id) -- tie-breaker condition
       )
   );

score 1 · Accepted Answer · answered Aug 28 '14 at 14:35

1

Use distinct on

select distinct on (p.id)
    p.id, 
    p.name, 
    v.visit_date as first_visit_date, 
    v.reference_number
from
    patients p
    inner join
    visits v on p.id = v.patient_id
order by p.id, v.visit_date

http://www.postgresql.org/docs/current/static/sql-select.html#SQL-DISTINCT

answered Aug 28 '14 at 14:35

Clodoaldo Neto

118,695
26
233
260

I chose postgresql for my example, but I've become aware that ```distinct on``` isn't available in other dialects. Can you show how this might be done in Oracle, for example? – Samo Aug 28 '14 at 18:41
Or I can post a different question and link to this one for reference if you prefer. – Samo Aug 28 '14 at 18:42
@Samo Yes that is a totally different answer. So a different question please. – Clodoaldo Neto Aug 28 '14 at 18:44
http://stackoverflow.com/questions/25556749/oracle-with-a-one-to-many-relationship-select-distinct-rows-based-on-a-min-va – Samo Aug 28 '14 at 19:40

score 0 · Answer 3 · answered Aug 28 '14 at 14:43

0

You want an aggregate query to reduce multiple rows of the visits table to one row per patient ID. LIMIT is a bit messy because it's not strictly relational, though you could probably make it work if you were sufficiently determined. If the query optimizer is decent enough then there should be no performance difference between a version using LIMIT (appropriately) and this:

SELECT
  patients.id, 
  patients.name, 
  visits.first_visit_date AS first_visit_date, 
  visits.reference_number
FROM
  patients
INNER JOIN (
  SELECT
    patient_id,
    MIN(visit_date) as first_visit_date
  FROM
    visits
  GROUP BY
    patient_id
) visits ON
  visits.patient_id = patients.id

I prefer the version with MIN() because it is clearer to me. Also, if it's clearer to me, then it stands a better chance of being clearer to the query optimizer, too.

answered Aug 28 '14 at 14:43

John Bollinger

160,171
8
81
157

I prefer that to `distinct on` as well, even though it's longer, because the `distinct on` solution assumes that the first row in the specified order will be the one retained. Maybe that works, but I can't be sure without reading the docs, and that's a maintenance problem. YMMV. – John Bollinger Aug 28 '14 at 14:48
A couple problems with this. First, ```visits.reference``` number doesn't exist in your query because you didn't select it in your ```INNER JOIN```. Secondly, when I add that in, I still get duplicate rows. – Samo Aug 28 '14 at 14:51
2

_Maybe that works, but I can't be sure without reading the docs_ Yes read the docs to be sure it works before saying _maybe_. The second point is valid. For someone coming from other DBMS it is confusing. For a Postgresql user it is very clear. – Clodoaldo Neto Aug 28 '14 at 14:59
@JohnBollinger: the `distinct on` will be much faster than a self join with an aggregate. Using window functions would be faster than the self-join as well – Aug 28 '14 at 15:56

score 0 · Answer 4 · answered Aug 28 '14 at 15:23

Although you are using PostgreSQL, just in case it might be useful or "inspiring", here's the T-SQL version.

SELECT p.id, name, first_visit.visit_date as first_visit_date, v.reference_number as first_visit_reference_number
FROM patients p
INNER JOIN 
(
    SELECT patient_id, MIN(visit_date) AS visit_date
    FROM visits
    GROUP BY patient_id
) first_visit ON first_visit.patient_id = p.id
INNER JOIN visits v ON v.patient_id = p.id AND v.visit_date = first_visit.visit_date

In one to many relationship, return distinct rows based on MIN value

4 Answers4

Linked