MySQL - Finding the earliest date by joining records from two tables

Question

I have two tables, store_customers and additional_customers.

Each store_id in the store_customers table must be unique
Each store_id, customer_id in the additional_customers table must be unique.

Each store_id in the additional_customers table must exist in the store_customers table

CREATE TABLE store_customers (
   autoinc     int(10) unsigned NOT NULL AUTO_INCREMENT,
   store_id    varchar(50) NOT NULL,
   customer_id varchar(50) NOT NULL,
   datetime    datetime    NOT NULL,
   field_1     varchar(50) DEFAULT NULL,
   field_2     varchar(50) DEFAULT NULL,
   PRIMARY KEY ('autoinc'),
   UNIQUE KEY 'store_id' ('store_id'),
   KEY 'customer_number' ('customer_number')
);
INSERT INTO store_customers (store_id, customer_id, datetime, field_1, field_2)
VALUES ('100', '10', '2011-01-01', 'aaa', 'bbb'),
       ('200', '20', '2012-01-01', 'ccc', 'ddd'),
       ('300', '20', '2013-01-01', 'eee', 'fff'),
       ('400', '40', '2014-01-01', 'ggg', 'hhh'),
       ('500', '50', '2015-01-01', 'iii', 'jjj'),
       ('600', '50', '2016-01-01', 'kkk', 'lll'),
       ('700', '70', '2017-01-01', 'mmm', 'nnn'),
       ('800', '70', '2018-01-01', 'ooo', 'ppp'),
       ('900', '90', '2019-01-01', 'qqq', 'rrr');


CREATE TABLE additional_customers (
   store_id    varchar(50) NOT NULL
   customer_id varchar(50) NOT NULL
);  
INSERT INTO store_customers (store_id, customer_id)
VALUES ('400', '41'),
       ('400', '42'),
       ('500', '51'),
       ('500', '52'),
       ('700', '71'),
       ('700', '72'),
       ('800', '81'),
       ('800', '82'),
       ('900', '70');

To make things easier to read:

   SELECT * FROM store_customers;

   store_id  customer_id  datetime     field_1  field_2
   100       10           2011-01-01   aaa      bbb
   200       20           2012-01-01   ccc      ddd
   300       20           2013-01-01   eee      fff
   400       40           2014-01-01   ggg      hhh
   500       50           2015-01-01   iii      jjj
   600       50           2016-01-01   kkk      lll
   700       70           2017-01-01   mmm      nnn
   800       70           2018-01-01   ooo      ppp
   900       90           2019-01-01   qqq      rrr

   -------------------------------------------------------

   SELECT * FROM additional_customers;

   store_id  customer_id
   400       41
   400       42
   500       51
   500       52
   700       71
   700       72
   800       81
   800       82
   900       70

For each distinct customer_id, I need to find the matching record in store_customers with the earliest datetime.

For example, customer_id = 70 appears in store_customers twice, tied to two different store_ids (700, 800) - and in additional_customers once, with a different store_id (900) than the records in the prior table. For that customer_id, the earliest datetime is store_id = 700, so I get one row for that customer_id with matching data from store_customers WHERE store_id = 700.

My end result would be:

store_id  customer_id  datetime     field_1  field_2
100       10           2011-01-01   aaa      bbb           
200       20           2012-01-01   ccc      ddd
400       40           2014-01-01   ggg      hhh
400       41           2014-01-01   ggg      hhh
400       42           2014-01-01   ggg      hhh
500       50           2015-01-01   iii      jjj
500       51           2015-01-01   iii      jjj
500       52           2015-01-01   iii      jjj
700       70           2017-01-01   mmm      nnn
700       71           2017-01-01   mmm      nnn
700       72           2017-01-01   mmm      nnn
800       81           2018-01-01   ooo      ppp
800       82           2018-01-01   ooo      ppp
900       90           2019-01-01   qqq      rrr

My current query "works", but I think it may be picking rows arbitrarily(?) - which I don't want. Also, EXPLAIN tells me this is terribly inefficient:

SELECT   sc.store_id,
         customers.customer_id,
         MIN(sc.datetime),
         sc.field_1,
         sc.field_2
FROM     (
           SELECT store_id, customer_id
           FROM   store_customers
           UNION
           SELECT store_id, customer_id
           FROM   additional_customers
         ) AS customers
JOIN     store_customers sc
         ON sc.store_id    = customers.store_id
         OR sc.customer_id = customers.customer_id
GROUP BY customer_id;

How can I make a more accurate, efficient query to return my expected result?

Edit: I should add - I can add additional indexes if necessary. Also, I'm using MySQL version 5.7 - so I can't use window functions.

It will always be inefficient with larger data if you do not use indexes. Can you confirm if you use any? And if so you should include those in the create table query. If not I would recommend reading this one 1st: https://stackoverflow.com/questions/3049283/mysql-indexes-what-are-the-best-practices — George G, Jul 19 '22 at 20:19
@GeorgeG, there are indexes on the `store_customers` table. I edited my original post to include them. Also, I can add additional indexes if necessary. — ComputersAreNeat, Jul 19 '22 at 20:29
A simple join, combined with a window function (probably `FIRST_VALUE()` should get you there - https://dev.mysql.com/doc/refman/8.0/en/window-functions.html — Jerry, Jul 20 '22 at 07:55
Thanks @Jerry, but I'm on MySQL 5.7. I should have (and will) clarify that in my post. — ComputersAreNeat, Jul 20 '22 at 10:40

O. Jones · Accepted Answer · 2022-07-20T11:18:45.333

This class of problem is called greatest-n-per-group. You're looking for the least, not the greatest, but it's the same principle.

Let's break it down.

First, avoid misusing MySQL's notorious nonstandard extension to GROUP BY. That will get rid of the risk of returning arbitrary data, and force you to use GROUP BY correctly.

SET SESSION sql_mode = CONCAT(@@session.sql_mode, ',', 'ONLY_FULL_GROUP_BY');

Next, you need a subquery getting the list of customers. That is this. UNION deduplicates the list for you. (SQL is all about sets of data.)

SELECT customer_id FROM store_customers
 UNION
SELECT customer_id FROM additional_customers;

Next, you need to figure the earliest datetime in store_customers for each of those customer_id values.

SELECT customers.customer_id, MIN(store_customers.datetime) first_datetime
  FROM store_customers 
  JOIN (
          SELECT customer_id FROM store_customers
           UNION
          SELECT customer_id FROM additional_customers
       ) customers ON store_customers.customer_id = customers.customer_id
 GROUP BY customers.customer_id

Next you need to use that as a subquery to extract the store_customers row matching each earliest datetime.

SELECT store_customers.*
  FROM store_customers
  JOIN (
           SELECT customers.customer_id, MIN(store_customers.datetime) first_datetime
             FROM store_customers 
             JOIN (
                    SELECT customer_id FROM store_customers
                    UNION
                    SELECT customer_id FROM additional_customers
                  ) customers ON store_customers.customer_id = customers.customer_id
           GROUP BY customers.customer_id
       ) firsts ON store_customers.customer_id = firsts.customer_id
               AND store_customers.datetime = firsts.first_datetime

Finally, replace your index on customer_id with this compound index to speed up the search for the minimum date. (Notice that MySQL EXPLAIN output might be different for tiny datasets like your example.

ALTER TABLE store_customers
  DROP KEY customer_id,
  ADD KEY cust_date (customer_id, datetime DESC);

Here's a fiddle.

Nested queries put the Structured in Structured Query Language, eh?

MySQL - Finding the earliest date by joining records from two tables

1 Answers1

Linked