- We can do a "self-left-join" on
symbolNumber
, and match to other rows in the same group with higher Date
value on the right side.
- We will eventually consider only those rows, where higher date could not be found (meaning the current row belongs to highest date in the group).
Here is a solution avoiding subquery, and utilizing Left Join
:
SELECT t1.*
FROM MyTable AS t1
LEFT JOIN MyTable AS t2
ON t2.symbolNumber = t1.symbolNumber AND
t2.Date > t1.Date -- Joining to a row in same group with higher date
WHERE t1.symbolNumber IN (123, 555) AND
t2.symbolNumber IS NULL -- Higher date not found; so this is highest row
EDIT:
Benchmarking studies comparing Left Join
method v/s Derived Table (Subquery)
@Strawberry ran a little benchmark test in 5.6.21. Here's what he found...
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(id SERIAL PRIMARY KEY
,dense_user INT NOT NULL
,sparse_user INT NOT NULL
);
INSERT INTO my_table (dense_user,sparse_user)
SELECT RAND()*100,RAND()*100000;
INSERT INTO my_table (dense_user,sparse_user)
SELECT RAND()*100,RAND()*100000 FROM my_table;
-- REPEAT THIS LINE A FEW TIMES !!!
SELECT COUNT(DISTINCT dense_user) dense
, COUNT(DISTINCT sparse_user) sparse
, COUNT(*) total
FROM my_table;
+-------+--------+---------+
| dense | sparse | total |
+-------+--------+---------+
| 101 | 99999 | 1048576 |
+-------+--------+---------+
ALTER TABLE my_table ADD INDEX(dense_user);
ALTER TABLE my_table ADD INDEX(sparse_user);
--dense_test
SELECT x.*
FROM my_table x
LEFT
JOIN my_table y
ON y.dense_user = x.dense_user
AND y.id < x.id
WHERE y.id IS NULL
ORDER
BY dense_user
LIMIT 10;
+------+------------+-------------+
| id | dense_user | sparse_user |
+------+------------+-------------+
| 1212 | 0 | 1950 |
| 153 | 1 | 23193 |
| 255 | 2 | 27472 |
| 28 | 3 | 86440 |
| 18 | 4 | 47886 |
| 291 | 5 | 76563 |
| 15 | 6 | 85049 |
| 16 | 7 | 78384 |
| 135 | 8 | 52304 |
| 62 | 9 | 40930 |
+------+------------+-------------+
10 rows in set (2.64 sec)
SELECT x.*
FROM my_table x
JOIN
( SELECT dense_user, MIN(id) id FROM my_table GROUP BY dense_user ) y
ON y.dense_user = x.dense_user
AND y.id = x.id
ORDER
BY dense_user
LIMIT 10;
+------+------------+-------------+
| id | dense_user | sparse_user |
+------+------------+-------------+
| 1212 | 0 | 1950 |
| 153 | 1 | 23193 |
| 255 | 2 | 27472 |
| 28 | 3 | 86440 |
| 18 | 4 | 47886 |
| 291 | 5 | 76563 |
| 15 | 6 | 85049 |
| 16 | 7 | 78384 |
| 135 | 8 | 52304 |
| 62 | 9 | 40930 |
+------+------------+-------------+
10 rows in set (0.05 sec)
Uncorrelated query is 50 times faster.
--sparse test
SELECT x.*
FROM my_table x
LEFT
JOIN my_table y
ON y.sparse_user = x.sparse_user
AND y.id < x.id
WHERE y.id IS NULL
ORDER
BY sparse_user
LIMIT 10;
+--------+------------+-------------+
| id | dense_user | sparse_user |
+--------+------------+-------------+
| 165055 | 75 | 0 |
| 37598 | 63 | 1 |
| 170596 | 70 | 2 |
| 46142 | 87 | 3 |
| 33546 | 21 | 4 |
| 323114 | 87 | 5 |
| 86592 | 96 | 6 |
| 156711 | 36 | 7 |
| 17148 | 62 | 8 |
| 139965 | 71 | 9 |
+--------+------------+-------------+
10 rows in set (0.03 sec)
SELECT x.*
FROM my_table x
JOIN ( SELECT sparse_user, MIN(id) id FROM my_table GROUP BY sparse_user ) y
ON y.sparse_user = x.sparse_user
AND y.id = x.id
ORDER
BY sparse_user
LIMIT 10;
+--------+------------+-------------+
| id | dense_user | sparse_user |
+--------+------------+-------------+
| 165055 | 75 | 0 |
| 37598 | 63 | 1 |
| 170596 | 70 | 2 |
| 46142 | 87 | 3 |
| 33546 | 21 | 4 |
| 323114 | 87 | 5 |
| 86592 | 96 | 6 |
| 156711 | 36 | 7 |
| 17148 | 62 | 8 |
| 139965 | 71 | 9 |
+--------+------------+-------------+
10 rows in set (4.73 sec)
Exclusion Join is 150 times faster
However, as you move further up the result set, the picture begins to change very dramatically...
SELECT x.*
FROM my_table x
JOIN ( SELECT sparse_user, MIN(id) id FROM my_table GROUP BY sparse_user ) y
ON y.sparse_user = x.sparse_user
AND y.id = x.id
ORDER
BY sparse_user
LIMIT 10000,10;
+--------+------------+-------------+
| id | dense_user | sparse_user |
+--------+------------+-------------+
| 9810 | 93 | 10000 |
| 162438 | 4 | 10001 |
| 467371 | 62 | 10002 |
| 8258 | 13 | 10003 |
| 297049 | 17 | 10004 |
| 68354 | 23 | 10005 |
| 192701 | 64 | 10006 |
| 176225 | 92 | 10007 |
| 156595 | 37 | 10008 |
| 318266 | 1 | 10009 |
+--------+------------+-------------+
10 rows in set (9.17 sec)
SELECT x.*
FROM my_table x
LEFT
JOIN my_table y
ON y.sparse_user = x.sparse_user
AND y.id < x.id
WHERE y.id IS NULL
ORDER
BY sparse_user
LIMIT 10000,10;
+--------+------------+-------------+
| id | dense_user | sparse_user |
+--------+------------+-------------+
| 9810 | 93 | 10000 |
| 162438 | 4 | 10001 |
| 467371 | 62 | 10002 |
| 8258 | 13 | 10003 |
| 297049 | 17 | 10004 |
| 68354 | 23 | 10005 |
| 192701 | 64 | 10006 |
| 176225 | 92 | 10007 |
| 156595 | 37 | 10008 |
| 318266 | 1 | 10009 |
+--------+------------+-------------+
10 rows in set (32.19 sec) -- !!!
In summary, the exclusion join (the so-called 'strawberry query' can be (significantly) faster in certain, limited situations. More generally, an uncorrelated query will be faster.