1

I have a table that looks like this

ID | identifier | data  | created_at
------------------------------------
1  |     500    | test1 | 2011-08-30 15:27:29
2  |     501    | test1 | 2011-08-30 15:27:29
3  |     500    | test2 | 2011-08-30 15:28:29
4  |     865    | test3 | 2011-08-30 15:29:29
5  |     501    | test2 | 2011-08-30 15:31:29
6  |     500    | test3 | 2011-08-30 15:31:29

What I need is the most up to date entry for each identifier, that could be decided by either the ID or the date in created_at. I assumed ID is the better choice due to the indexing.

I would expect this result set:

4  |     865    | test3 | 2011-08-30 15:29:29
5  |     501    | test2 | 2011-08-30 15:31:29
6  |     500    | test3 | 2011-08-30 15:31:29

The result should be ordered by either date or ID in ascending order.

It's important that this is a table that contains ~ 8 millions of rows.

I tried quite some approaches now with self joining and sub queries. Unfortunately all of those came out with either wrong results or half a decade of run time.

To provide an example:

SELECT lo1.* 
FROM table lo1
INNER JOIN
(
    SELECT MAX(id) MaxID, identifier, id
    FROM table 
    GROUP BY identifier
) lo2
  ON lo1.identifier= lo2.identifier
  AND lo1.id = lo2.MaxID
ORDER BY lo1.id DESC
LIMIT 10

The above query takes very long and does sometimes not return the latest result for an identifier, not quite sure why though.

Does anyone have an approach that is able to fetch the required result sets and preferably does not take a decade?

As asked, here is the create code:

CREATE TABLE `table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `identifier` int(11) NOT NULL,
  `data` varchar(200) COLLATE latin1_bin NOT NULL,
  `created_at` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `identifier` (`identifier`),
  KEY `created_at` (`created_at`),
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_bin
Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
TheNiceGuy
  • 3,462
  • 8
  • 34
  • 64
  • `SELECT MAX(id) MaxID, identifier, id FROM table GROUP BY identifier` is a invalid ansi group by SQL read this https://www.psce.com/en/blog/2012/05/15/mysql-mistakes-do-you-use-group-by-correctly/ – Raymond Nijland Mar 26 '18 at 21:05
  • Are the identifiers known ahead of time? If so it could be a nice idea to submit a query for each identifier. That could be done in parallel. – Stephen Crosby Mar 26 '18 at 21:06
  • Show us the output off `SHOW CREATE TABLE [table]` – Raymond Nijland Mar 26 '18 at 21:07
  • The query won't even run. Fix your GROUP BY first, then we can start debugging your query. – Eric Mar 26 '18 at 21:10
  • That's odd, for me the query runs fine... What is the exact error that you get? – TheNiceGuy Mar 26 '18 at 21:12
  • it's not odd @Michael there is a sql_mode called 'only_full_group_by' if you enable that. MySQL will enforcing you to write valid ansi group by queries. – Raymond Nijland Mar 26 '18 at 21:15
  • Ah yeah, that is true, we have disabled that.. – TheNiceGuy Mar 26 '18 at 21:16
  • To be honest, I'm sort of clueless on how to continue.. The above query is the best I got at the moment, the question is mainly about guidance on how to achive the desired result set with the best performance. It's not about debugging a specific query, that was just to provide one of the examples that I attempted. – TheNiceGuy Mar 26 '18 at 21:18
  • like i said before "Show us the output off `SHOW CREATE TABLE [table]`" then we see indexes – Raymond Nijland Mar 26 '18 at 21:20
  • I added it to the question. Thanks – TheNiceGuy Mar 26 '18 at 21:23
  • Have you tried the self left join solution outlined in the following question? https://stackoverflow.com/questions/7745609/sql-select-only-rows-with-max-value-on-a-column – Shadow Mar 26 '18 at 22:16

1 Answers1

2

The correct query that gives the correct results but it won't scale on larger tables.

Query

SELECT
  `table`.*
FROM
  `table`
INNER JOIN
(
    SELECT
        MAX(id) AS MaxID
      , identifier
    FROM
      `table` 
    GROUP BY
      identifier
    #disables GROUP BY Sorting might make the query faster.
    ORDER BY
      NULL  
) `table_group`
ON
 `table`.ID = `table_group`.MaxID
ORDER BY
  `table`.ID DESC
LIMIT 10

Result

| id | identifier |  data |           created_at |
|----|------------|-------|----------------------|
|  6 |        500 | test3 | 2011-08-30T15:31:29Z |
|  5 |        501 | test2 | 2011-08-30T15:31:29Z |
|  4 |        865 | test3 | 2011-08-30T15:29:29Z |

see demo http://www.sqlfiddle.com/#!9/7f4401/4

But when you check "View Execution Plan" you can see "Using where; Using temporary; Using filesort" in the extra column meaning MySQL needs to use a quicksort algorithm "Using temporary;" means the quicksort algorithm first will be run on a memory temporary table.
If the memory temporary table becomes to large it will be converted to a MyISAM on disk temporary table.
Meaning the quicksort will need disk based random i/o to sort which is slow on disks.
So this method will not scale on the table with ~8 millions of rows.

This query below also gives the same results but it should be more optimized

Query

SELECT 
 `table`.*
FROM
 `table` 
INNER JOIN ( 

  SELECT
    `table`.ID
  FROM
    `table`
  INNER JOIN
  (
      SELECT
          MAX(id) AS MaxID
        , identifier
      FROM
        `table` 
      GROUP BY
        identifier
      #disables GROUP BY Sorting might make the query faster.
      ORDER BY
        NULL  
  )
   AS `table_group`
  ON
   `table`.ID = `table_group`.MaxID
)
  AS `table_group_max`
ON
 `table`.ID = `table_group_max`.ID
ORDER BY
 `table`.ID DESC
LIMIT 10

Result

| id | identifier |  data |           created_at |
|----|------------|-------|----------------------|
|  6 |        500 | test3 | 2011-08-30T15:31:29Z |
|  5 |        501 | test2 | 2011-08-30T15:31:29Z |
|  4 |        865 | test3 | 2011-08-30T15:29:29Z |

see demo http://www.sqlfiddle.com/#!9/7f4401/21

When you check "View Execution Plan" there is no more "Using temporary; Using filesort" meaning the query should be more optimal then the previous query and should in theory execute faster.
Because the combination "Using temporary; Using filesort" can really be a performance killer like explained.

Raymond Nijland
  • 11,488
  • 2
  • 22
  • 34
  • 1
    Thanks Raymond, excellent answer. The second query performs very well, even on such a big dataset. Now after understanding it, it makes much sense to do it that way. Thanks a lot for your help, highly appreciated! :) – TheNiceGuy Mar 27 '18 at 17:31