0

I have the following table.

+--------------------+--------------+-------+
Date                 | SymbolNumber | Value
+--------------------+--------------+-------+
 2018-08-31 15:00:00 | 123          | data
 2018-09-31 15:00:00 | 456          | data
 2018-09-31 15:00:00 | 123          | data
 2018-09-31 15:00:00 | 555          | data
 2018-10-31 15:00:00 | 555          | data
 2018-10-31 15:00:00 | 231          | data
 2018-10-31 15:00:00 | 123          | data
 2018-11-31 15:00:00 | 123          | data
 2018-11-31 15:00:00 | 555          | data
 2018-12-31 15:00:00 | 123          | data
 2018-12-31 15:00:00 | 555          | data

I need a query that can select the last row of each SymbolNumber stated in the query.

SELECT
    *
FROM
    MyTable
WHERE
    symbolNumber IN (123, 555)
AND
    **lastOfRow ordered by latest-date**

Expected results:

 2018-12-31 15:00:00 | 123 | data
 2018-12-31 15:00:00 | 555 | data

How can I do this?

Shidersz
  • 16,846
  • 2
  • 23
  • 48
Jonathan002
  • 9,639
  • 8
  • 37
  • 58

2 Answers2

0

First, you will need a query that get the latest date for each symbolNumber. Second, you can inner join to this table (using date) for get the rest of the columns. Like this:

SELECT
    t.*
FROM
   <table_name> AS t
INNER JOIN
    (SELECT
        symbolNumber,
        MAX(date) AS maxDate
    FROM
        <table_name>
    GROUP BY
       symbolNumber) AS latest_date ON latest_date.symbolNumber = t.symbolNumber AND latest_date.maxDate = t.date

The previous query will get latest data for each existing symbolNumber on the table. If you want to restrict to symbolNumbers: 123 and 555, you will need to made next modification:

SELECT
    t.*
FROM
   <table_name> AS t
INNER JOIN
    (SELECT
        symbolNumber,
        MAX(date) AS maxDate
    FROM
        <table_name>
    WHERE
        symbolNumber IN (123, 555)
    GROUP BY
       symbolNumber) AS latest_date ON latest_date.symbolNumber = t.symbolNumber AND latest_date.maxDate = t.date
Shidersz
  • 16,846
  • 2
  • 23
  • 48
0
  • We can do a "self-left-join" on symbolNumber, and match to other rows in the same group with higher Date value on the right side.
  • We will eventually consider only those rows, where higher date could not be found (meaning the current row belongs to highest date in the group).

Here is a solution avoiding subquery, and utilizing Left Join:

SELECT t1.* 
FROM MyTable AS t1 
LEFT JOIN MyTable AS t2 
  ON t2.symbolNumber = t1.symbolNumber AND 
     t2.Date > t1.Date -- Joining to a row in same group with higher date
WHERE t1.symbolNumber IN (123, 555) AND 
      t2.symbolNumber IS NULL  -- Higher date not found; so this is highest row

EDIT:

Benchmarking studies comparing Left Join method v/s Derived Table (Subquery)

@Strawberry ran a little benchmark test in 5.6.21. Here's what he found...

DROP TABLE IF EXISTS my_table;

CREATE TABLE my_table
(id SERIAL PRIMARY KEY
,dense_user INT NOT NULL
,sparse_user INT NOT NULL
);

INSERT INTO my_table (dense_user,sparse_user) 
SELECT RAND()*100,RAND()*100000;

INSERT INTO my_table (dense_user,sparse_user)
SELECT RAND()*100,RAND()*100000 FROM my_table;
-- REPEAT THIS LINE A FEW TIMES !!!

SELECT COUNT(DISTINCT dense_user) dense
     , COUNT(DISTINCT sparse_user) sparse
     , COUNT(*) total 
  FROM my_table;
+-------+--------+---------+
| dense | sparse | total   |
+-------+--------+---------+
|   101 |  99999 | 1048576 |
+-------+--------+---------+

ALTER TABLE my_table ADD INDEX(dense_user);

ALTER TABLE my_table ADD INDEX(sparse_user);

--dense_test
SELECT x.* 
  FROM my_table x 
  LEFT 
  JOIN my_table y 
    ON y.dense_user = x.dense_user 
   AND y.id < x.id 
 WHERE y.id IS NULL 
 ORDER 
    BY dense_user 
 LIMIT 10;
+------+------------+-------------+
| id   | dense_user | sparse_user |
+------+------------+-------------+
| 1212 |          0 |        1950 |
|  153 |          1 |       23193 |
|  255 |          2 |       27472 |
|   28 |          3 |       86440 |
|   18 |          4 |       47886 |
|  291 |          5 |       76563 |
|   15 |          6 |       85049 |
|   16 |          7 |       78384 |
|  135 |          8 |       52304 |
|   62 |          9 |       40930 |
+------+------------+-------------+
10 rows in set (2.64 sec)

SELECT x.* 
  FROM my_table x 
  JOIN 
     ( SELECT dense_user, MIN(id) id FROM my_table GROUP BY dense_user ) y 
    ON y.dense_user = x.dense_user 
   AND y.id = x.id 
 ORDER 
    BY dense_user 
 LIMIT 10;
+------+------------+-------------+
| id   | dense_user | sparse_user |
+------+------------+-------------+
| 1212 |          0 |        1950 |
|  153 |          1 |       23193 |
|  255 |          2 |       27472 |
|   28 |          3 |       86440 |
|   18 |          4 |       47886 |
|  291 |          5 |       76563 |
|   15 |          6 |       85049 |
|   16 |          7 |       78384 |
|  135 |          8 |       52304 |
|   62 |          9 |       40930 |
+------+------------+-------------+
10 rows in set (0.05 sec)

Uncorrelated query is 50 times faster.

--sparse test
SELECT x.* 
  FROM my_table x 
  LEFT 
  JOIN my_table y 
    ON y.sparse_user = x.sparse_user 
   AND y.id < x.id 
 WHERE y.id IS NULL 
 ORDER 
    BY sparse_user 
 LIMIT 10;
+--------+------------+-------------+
| id     | dense_user | sparse_user |
+--------+------------+-------------+
| 165055 |         75 |           0 |
|  37598 |         63 |           1 |
| 170596 |         70 |           2 |
|  46142 |         87 |           3 |
|  33546 |         21 |           4 |
| 323114 |         87 |           5 |
|  86592 |         96 |           6 |
| 156711 |         36 |           7 |
|  17148 |         62 |           8 |
| 139965 |         71 |           9 |
+--------+------------+-------------+
10 rows in set (0.03 sec)

SELECT x.* 
  FROM my_table x 
  JOIN ( SELECT sparse_user, MIN(id) id FROM my_table GROUP BY sparse_user ) y 
    ON y.sparse_user = x.sparse_user 
   AND y.id = x.id 
 ORDER 
    BY sparse_user 
 LIMIT 10;
+--------+------------+-------------+
| id     | dense_user | sparse_user |
+--------+------------+-------------+
| 165055 |         75 |           0 |
|  37598 |         63 |           1 |
| 170596 |         70 |           2 |
|  46142 |         87 |           3 |
|  33546 |         21 |           4 |
| 323114 |         87 |           5 |
|  86592 |         96 |           6 |
| 156711 |         36 |           7 |
|  17148 |         62 |           8 |
| 139965 |         71 |           9 |
+--------+------------+-------------+
10 rows in set (4.73 sec)

Exclusion Join is 150 times faster

However, as you move further up the result set, the picture begins to change very dramatically...

SELECT x.* 
  FROM my_table x 
  JOIN ( SELECT sparse_user, MIN(id) id FROM my_table GROUP BY sparse_user ) y 
    ON y.sparse_user = x.sparse_user 
   AND y.id = x.id 
 ORDER 
    BY sparse_user 
 LIMIT 10000,10; 
+--------+------------+-------------+
| id     | dense_user | sparse_user |
+--------+------------+-------------+
|   9810 |         93 |       10000 |
| 162438 |          4 |       10001 |
| 467371 |         62 |       10002 |
|   8258 |         13 |       10003 |
| 297049 |         17 |       10004 |
|  68354 |         23 |       10005 |
| 192701 |         64 |       10006 |
| 176225 |         92 |       10007 |
| 156595 |         37 |       10008 |
| 318266 |          1 |       10009 |
+--------+------------+-------------+
10 rows in set (9.17 sec)

SELECT x.* 
  FROM my_table x 
  LEFT 
  JOIN my_table y 
    ON y.sparse_user = x.sparse_user 
   AND y.id < x.id 
 WHERE y.id IS NULL 
 ORDER 
    BY sparse_user 
 LIMIT 10000,10;
+--------+------------+-------------+
| id     | dense_user | sparse_user |
+--------+------------+-------------+
|   9810 |         93 |       10000 |
| 162438 |          4 |       10001 |
| 467371 |         62 |       10002 |
|   8258 |         13 |       10003 |
| 297049 |         17 |       10004 |
|  68354 |         23 |       10005 |
| 192701 |         64 |       10006 |
| 176225 |         92 |       10007 |
| 156595 |         37 |       10008 |
| 318266 |          1 |       10009 |
+--------+------------+-------------+
10 rows in set (32.19 sec) -- !!!

In summary, the exclusion join (the so-called 'strawberry query' can be (significantly) faster in certain, limited situations. More generally, an uncorrelated query will be faster.

Strawberry
  • 33,750
  • 13
  • 40
  • 57
Madhur Bhaiya
  • 28,155
  • 10
  • 49
  • 57
  • Fond as I am of this solution, I just don't think we would bother with it anymore, given that the alternatives are so much more efficient – Strawberry Nov 02 '18 at 07:38
  • @Strawberry just an alternative. I did read one of SO answers of the past, which "benchmarked" and found this variation faster. Although, with analytic (window) functions available in newer versions, all these SQL "tricks" can be done away with. – Madhur Bhaiya Nov 02 '18 at 07:41
  • 1
    I find that assertion highly improbable- but happy to review the benchmark. I've not yet seen a benchmark test that included window functions, but in my experience a conventional correlated (or uncorrelated) subquery solution will always massively outperform the above. It's a shame because in some circles, your solution is known as a 'Strawberry' query. – Strawberry Nov 02 '18 at 07:46
  • @Strawberry please see: https://stackoverflow.com/a/1313293/2469308 – Madhur Bhaiya Nov 02 '18 at 07:48
  • 1
    I'm very suspicious of BK's results (0.28 seconds vs 1 minute something). I'll test it myself later and report back - but I suspythst some serious caching is going on here – Strawberry Nov 02 '18 at 07:57
  • 1
    'suspythst' -> 'suspect' !! – Strawberry Nov 02 '18 at 10:10
  • @Strawberry thanks for the benchmark. Interesting observations. I will keep them in the answer (for future reference). – Madhur Bhaiya Nov 03 '18 at 19:12