0

I have a log table in MySQL (5.7.14) with the following schema:

CREATE TABLE logs
(
  id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
  entry_date DATE NOT NULL,
  original_date DATE NOT NULL,
  ref_no VARCHAR(30) NOT NULL
) Engine=InnoDB;

INSERT INTO logs VALUES
(1,'2020-01-01','2020-01-01','XYZ'),
(2,'2020-01-01','2020-01-01','ABC'),
(3,'2020-01-02','2020-01-01','XYZ'),
(4,'2020-01-02','2020-01-01','ABC'),
(5,'2020-01-03','2020-01-02','XYZ'),
(6,'2020-01-03','2020-01-01','ABC');

I want to return the first row for each unique (original_date, ref_no) pairing, where 'first' is defined as 'lowest id'.

For example, if I had the following data:

id|entry_date|original_date|ref_no
--+----------+-------------+------
1 |2020-01-01|2020-01-01   |XYZ
2 |2020-01-01|2020-01-01   |ABC
3 |2020-01-02|2020-01-01   |XYZ
4 |2020-01-02|2020-01-01   |ABC
5 |2020-01-03|2020-01-02   |XYZ
6 |2020-01-03|2020-01-01   |ABC

I would want the query to return:

id|entry_date|original_date|ref_no
--+----------+-------------+------
1 |2020-01-01|2020-01-01   |XYZ
2 |2020-01-01|2020-01-01   |ABC
5 |2020-01-03|2020-01-02   |XYZ

In other words:

  • Row 1 is returned because we haven't seen 2020-01-01,XYZ before.
  • Row 2 is returned because we haven't seen 2020-01-01,ABC before.
  • Row 3 is not returned because we have seen 2020-01-01,XYZ before (row 1).
  • Row 4 is not returned because we have seen 2020-01-01,ABC before (row 2).
  • Row 5 is returned because we haven't seen 2020-01-02,XYZ before.
  • Row 6 is not returned because we have seen 2020-01-01,ABC before (row 2).

Is there a way to do this directly in SQL? I've considered DISTINCT but I think that only returns the distinct columns, whereas I want the full row.

Strawberry
  • 33,750
  • 13
  • 40
  • 57
pwaring
  • 3,032
  • 8
  • 30
  • 46
  • @Strawberry Please re-open this question. The one you've linked as a duplicate is for PostgreSQL, not MySQL. – pwaring Jan 06 '20 at 15:18
  • Oops, well the point is, it's the single most frequently asked question under this tag too. – Strawberry Jan 06 '20 at 15:20
  • The answers in that question are for different database engines though, and far less succinct than the ones posted here where I've got three options to choose from. That question isn't even tagged as MySQL. – pwaring Jan 06 '20 at 15:25
  • I don't really understand your point. – Strawberry Jan 06 '20 at 15:26
  • My point is that you've marked a MySQL-specific question, with MySQL-specific answers, as a duplicate of a PostgreSQL question with PostgreSQL answers. – pwaring Jan 06 '20 at 15:29
  • Unfortunately, once closed, I can't change the reason for closing. – Strawberry Jan 06 '20 at 15:30

3 Answers3

1

You can use a correlated subquery:

select l.*
from logs l
where l.id = (select min(l2.id)
              from logs l2
              where l2.original_date = l.original_date and
                    l2.ref_no = l.ref_no
             );

For performance, you want an index on logs(original_date, ref_no, id).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks, that seems to have done the trick. I have accepted the other answer as the correlated subquery was incredibly slow (8 mins vs 0.05s), but this answer did work. – pwaring Jan 06 '20 at 16:19
  • @pwaring . . . Did you have the recommended index? – Gordon Linoff Jan 06 '20 at 16:22
  • No, there didn't seem to be much point adding an index to potentially speed up a query where an alternative that returns the same results was substantially faster without an index (this table is INSERT heavy so an index would come with an overhead). Also I may want to pick out other columns in future so I'd have to add another index each time. – pwaring Jan 06 '20 at 16:35
1

To avoid a correlated subquery you can do:

select l.*
from logs l
join (
  select original_date, ref_no, min(id) as min_id
  from logs
  group by original_date, ref_no
) x on l.id = x.min_id
The Impaler
  • 45,731
  • 9
  • 39
  • 76
  • Is there a particular reason to avoid a correlated subquery? Execution speed is not a major issue here as it's for manual debugging so I don't mind if the query takes a while to return. – pwaring Jan 06 '20 at 15:09
  • Nevermind, I benchmarked both queries and this one was several orders of magnitude faster (not sure why, `EXPLAIN` wasn't too helpful), so it's the accepted answer. – pwaring Jan 06 '20 at 16:19
  • 1
    Well, a correlated subquery is usually slower than a table expression, since it needs to be executed once for each row of the primary table. It could be just a few percentage points slower, or dramatically slower as in your case. – The Impaler Jan 06 '20 at 20:39
0

Try this:

select t1.*
from logs AS t1
left join logs AS t2 on 
(
  t2.original_date = t1.original_date and
  t2.ref_no = t1.ref_no and
  t2.id < t1.id
)
where
    t2.original_date is null and
    t2.ref_no is null
nunoq
  • 550
  • 2
  • 7