Find two closest elements from one table to other element from another table

Question

I have two tables:

DROP TABLE IF EXISTS `left_table`;
CREATE TABLE `left_table` (
  `l_id` INT(11) NOT NULL AUTO_INCREMENT,
  `l_curr_time` INT(11) NOT NULL,
  PRIMARY KEY(l_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DROP TABLE IF EXISTS `right_table`;
CREATE TABLE `right_table` (
  `r_id` INT(11) NOT NULL AUTO_INCREMENT,
  `r_curr_time` INT(11) NOT NULL,
  PRIMARY KEY(r_id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO left_table(l_curr_time) VALUES
(3),(4),(6),(10),(13);

INSERT INTO right_table(r_curr_time) VALUES
(1),(5),(7),(8),(11),(12);

I want to map (if exists) two closest r_curr_time from right_table to each l_curr_time from left_table such that r_curr_time must be greater or equal to l_curr_time.

The expected result for given values should be:

+------+-------------+-------------+
| l_id | l_curr_time | r_curr_time |
+------+-------------+-------------+
|    1 |           3 |           5 |
|    1 |           3 |           7 |
|    2 |           4 |           5 |
|    2 |           4 |           7 |
|    3 |           6 |           7 |
|    3 |           6 |           8 |
|    4 |          10 |          11 |
|    4 |          10 |          12 |
+------+-------------+-------------+

I have following solution which works for one closest value. But I do not like it very much because it silently rely on fact that GROUP BY will remain the first occurrence from group:

SELECT l_id, l_curr_time, r_curr_time, time_diff FROM
(

    SELECT *, ABS(r_curr_time - l_curr_time) AS time_diff
    FROM left_table
    JOIN right_table ON 1=1
    WHERE r_curr_time >= l_curr_time
    ORDER BY l_id ASC, time_diff ASC
) t
GROUP BY l_id;

The output is following:

+------+-------------+-------------+-----------+
| l_id | l_curr_time | r_curr_time | time_diff |
+------+-------------+-------------+-----------+
|    1 |           3 |           5 |         2 |
|    2 |           4 |           5 |         1 |
|    3 |           6 |           7 |         1 |
|    4 |          10 |          11 |         1 |
+------+-------------+-------------+-----------+
4 rows in set (0.00 sec)

As you can see I am doing JOIN ON 1=1 is this OK also for large data (e.g. if both left_table and right_table has 10000 rows then Cartesian product will be 10^8 long)? Despite this lack I thing JOIN ON 1=1 is the only possible solution because first I need to create all possible combinations from existing tables and then pick up the ones which satisfies the condition, but if I'm wrong please correct me. Thanks.

score 1 · Accepted Answer · answered Oct 27 '17 at 00:26

This question is not trivial. In SQL Server or postgrsql it would be very easy because of the row_number() over x statement. This is not present in mysql. In mysql you have to deal with variables and chained select statements.

To solve this problem you have to combine multiple concepts. I will try to explain them one after the other to came to a solution that fits your question.

Lets start easy: How to build a table that contains the information of left_table and right_table?

Use a join. In this particular problem a left join and as the join condition we set that l_curr_time has to be smaller than r_curr_time. To make the rest easier we order this table by l_curr_time and r_curr_time. The statement is like the following:

SELECT l_id, l_curr_time, r_curr_time
FROM left_table l
LEFT JOIN right_table r ON l.l_curr_time<r.r_curr_time
ORDER BY l.l_curr_time, r.r_curr_time;

Now we have a table that is ordered and contains the information we want... but too many of them ;) Because the table is ordered it would be amazing if mysql could select only the two first occurent rows for each value in l_curr_time. This is not possible. We have to do it by ourselfs

mid part: How to number rows?

Use a variable! If you want to number a table you can use a mysql variable. There are two things to do: First of all we have to declare and define the variable. Second we have to increment this variable. Let's say we have a table with names and we want to know the position of all names when we order them by name:

SELECT name, @num:=@num+1 /* increment */
FROM table t, (SELECT @num:=0) as c
ORDER BY name ASC;

Hard part: How to number subset of rows depending of the value of one field?

Use variables to count (take a look above) and a variable for state pattern. We use the same principe like above but now we take a variable and save the value of the field we want depend on. If the value changes we reset the counter variable to zero. Again: This second variable have to be declared and defined. New Part: resetting a different variable depending on the content of the state variable:

SELECT
  l_id,
  l_curr_time,
  r_curr_time,
  @num := IF( /* (re)set num (the counter)... */
    @l_curr_time = l_curr_time,
    @num:= @num + 1, /* increment if the variable equals the actual l_curr_time field value */
    1 /* reset to 1 if the values are not equal */
  ) as row_num,
  @l_curr_time:=l_curr_time as lct /* state variable that holds the l_curr_time value */
FROM ( /* table from Step 1 of the explanation */
  SELECT l_id, l_curr_time, r_curr_time
  FROM left_table l
  LEFT JOIN right_table r ON l.l_curr_time<r.r_curr_time
  ORDER BY l.l_curr_time, r.r_curr_time
) as joinedTable

Now we have a table that holds all combinations we want (but too many) and all rows are numbered depending on the value of the l_curr_time field. In other words: Each subset is numbered from 1 to the amount of matching r_curr_time values that are greather or equal than l_curr_time.

Again the easy part: select all the values we want and depending on the row number

This part is easy. because the table we created in 3. is ordered and numbered we can filter by the number (it has to be smaller or equal to 2). Furthermore we select only the columns we're interessted in:

SELECT l_id, l_curr_time, r_curr_time, row_num
FROM ( /* table from step 3. */
  SELECT
    l_id,
    l_curr_time,
    r_curr_time,
    @num := IF(
      @l_curr_time = l_curr_time,
      @num:= @num + 1,
      1
    ) as row_num,
    @l_curr_time:=l_curr_time as lct
  FROM (
    SELECT l_id, l_curr_time, r_curr_time
    FROM left_table l
    LEFT JOIN right_table r ON l.l_curr_time<r.r_curr_time
    ORDER BY l.l_curr_time, r.r_curr_time
  ) as joinedTable
) as numberedJoinedTable,(
  SELECT @l_curr_time:='',@num:=0 /* define the state variable and the number variable */
) as counterTable
HAVING row_num<=2; /* the number has to be smaller or equal to 2 */

That's it. This statement returns exactly what you want. You can see this statement in action in this sqlfiddle.

I think you’re making a bit of a meal out of it. – Strawberry Oct 27 '17 at 06:14 — Strawberry, Oct 27 '17 at 06:14

score 1 · Answer 2 · answered Oct 27 '17 at 06:42

1

JoshuaK has the right idea. I just think it could be expressed a little more succinctly...

How about:

SELECT n.l_id
     , n.l_curr_time
     , n.r_curr_time
  FROM 
     ( SELECT a.*
            , CASE WHEN @prev = l_id THEN @i:=@i+1 ELSE @i:=1 END i
            , @prev := l_id prev
         FROM 
            ( SELECT l.*
                   , r.r_curr_time
                FROM left_table l
                JOIN right_table r
                  ON r.r_curr_time >= l.l_curr_time
            ) a
         JOIN 
            ( SELECT @prev := null,@i:=0) vars
        ORDER 
           BY l_id,r_curr_time
     ) n
 WHERE i<=2;

answered Oct 27 '17 at 06:42

Strawberry

33,750
13
40
57

Hi @Strawberry. Thanks for pointing to my answer, but a question: What did you do better? You changed the `if`-statement against a `when` statement. Used `*`-selector ([which is not recommend](https://stackoverflow.com/questions/3639861/why-is-select-considered-harmful)). The only good part (in my opinion) is: you moved the join inside and so you could use `where` instead of `having`. The execution plan is showing no difference in how it is handled inside in exception of you have one derived select more than me and I have one primary select more than you. Am I missing something? – Joshua K Oct 27 '17 at 12:45
@JoshuaK, I was going to suggest that my query might be more performant, but in my tests both were terrible - I wonder if there’s a better solution. – Strawberry Oct 30 '17 at 07:01

Find two closest elements from one table to other element from another table

2 Answers2