Improving speed of SQL query with MAX, WHERE, and GROUP BY on three different columns

Question

I am attempting to speed up a query that takes around 60 seconds to complete on a table of ~20 million rows.

For this example, the table has three columns (id, dateAdded, name). id is the primary key. The indexes I have added to the table are:

(dateAdded)
(name)
(id, name)
(id, name, dateAdded)

The query I am trying to run is:

SELECT MAX(id) as id, name 
FROM exampletable 
WHERE dateAdded <= '2014-01-20 12:00:00' 
GROUP BY name 
ORDER BY NULL;

The date is variable from query to query.

The objective of this is to get the most recent entry for each name at or before the date added.

When I use explain on the query it tells me that it is using the (id, name, dateAdded) index.

+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+
| id | select_type | table            | type  | possible_keys    | key                                          | key_len | ref  | rows     | Extra                                                     |
+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+
|  1 | SIMPLE      | exampletable     | index | date_added_index | id_element_name_date_added_index             | 162     | NULL | 22016957 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+

Edit: Added two new indexes from comments:

(dateAdded, name, id)
(name, id)

+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+
| id | select_type | table            | type  | possible_keys                                                 | key                                          | key_len | ref  | rows     | Extra                                     |
+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+
|  1 | SIMPLE      | exampletable     | index | date_added_index,date_added_name_id_index                     | id__name_date_added_index                    | 162     | NULL | 22040469 | Using where; Using index; Using temporary |
+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+

Edit: Added create table script.

CREATE TABLE `exampletable` (
  `id` int(10) NOT NULL auto_increment,
  `dateAdded` timestamp NULL default CURRENT_TIMESTAMP,
  `name` varchar(50) character set utf8 default '',
  PRIMARY KEY  (`id`),
  KEY `date_added_index` (`dateAdded`),
  KEY `name_index` USING BTREE (`name`),
  KEY `id_name_index` USING BTREE (`id`,`name`),
  KEY `id_name_date_added_index` USING BTREE (`id`,`dateAdded`,`name`),
  KEY `date_added_name_id_index` USING BTREE (`dateAdded`,`name`,`id`),
  KEY `name_id_index` USING BTREE (`name`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=22046064 DEFAULT CHARSET=latin1

Edit: Here is the Explain from the answer provided by HeavyE.

+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+
| id | select_type | table        | type  | possible_k                                                                               | key                      | key_len | ref                                              | rows | Extra                                 |
+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+
|  1 | PRIMARY     | <derived2>   | ALL   | NULL                                                                                     | NULL                     | NULL    | NULL                                             | 1732 | Using temporary; Using filesort       |
|  1 | PRIMARY     | example1     | ref   | date_added_index,name_index,date_added_name_id_index,name_id_index,name_date_added_index | date_added_name_id_index | 158     | maxDateByElement.dateAdded,maxDateByElement.name |    1 | Using where; Using index              |
|  2 | DERIVED     | exampletable | range | date_added_index,date_added_name_id_index                                                | name_date_added_index    | 158     | NULL                                             | 1743 | Using where; Using index for group-by |
+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+

I added the ORDER BY NULL because GROUPBY orders by default. By adding ORDER BY NULL, it removes the sort. See http://stackoverflow.com/questions/5231907/order-by-null-in-mysql. The problem also exists without the ORDER BY NULL. — FuryComputers, Jan 29 '14 at 18:24
If the "explain" looks good then maybe it's the hardware or the environment that's the issue? Curious, how many rows have dateAdded <= your date? — Rick S, Jan 29 '14 at 18:36
@RickS I'm still holding out hope that it can be solved in SQL, but it is possible that you are correct. To answer your question, most (90% - 95%) of the rows are typically <= the date. — FuryComputers, Jan 29 '14 at 18:41
Just as a side note, if I remove the WHERE clause, the query is still about the same speed. — FuryComputers, Jan 29 '14 at 18:54
I'm with @RickS on this: if the `EXPLAIN` looks right, it's hardware, settings, something outside the query. Can you post the `EXPLAIN` jic? — bishop, Jan 29 '14 at 20:29
Sounds like (dateAdded, name, id) would be better than (id, dateAdded, name) because that would keep it "index-only", but would also have everything in exactly the right order that's needed by this query. — Darius X., Jan 29 '14 at 20:32
@bishop I edited the post and added the explain. Hopefully that helps. — FuryComputers, Jan 29 '14 at 20:37
@DariusX. I wasn't aware that the order mattered, I will give that a shot and report back. — FuryComputers, Jan 29 '14 at 20:37
In addition to what Darius said before me, even though your query is using an index, it is still going through every row in the table. Changing the order of the columns in the index will probably help since it will actually be able to filter the rows using the index. Since most of the rows will match the condition of `dateAdded`, having it in the index might not be helpful so also try an index on `(name, id)`. — G-Nugget, Jan 29 '14 at 20:39
Note that `dateAdded` would be a bad choice for the first element in the index if it truly uses the hours, minutes, and seconds. If it can have as many different values as there are `ids`, then the index would be as large (row-wise) as the table itself. But if it really only uses the year/month/day then that's not bad. — dg99, Jan 29 '14 at 20:41
Some ideas: ensure `dateAdded` is just a `DATE`, put it first in your index, `ORDER BY id` and try rewriting the range out of the query, possibly by a self-join (so that the Cartesian product is less than a full table scan). — bishop, Jan 29 '14 at 21:11
Oh, and one other approach I forgot: [partition around dateAdded](http://dba.stackexchange.com/a/23007). I'm starting to feel like the query has an obvious optimization, but I'm snowed in right now and not near my desktop to check! — bishop, Jan 29 '14 at 21:52
@DariusX. I added the (dateAdded, name, id), unfortunately there were no obvious speed improvements. — FuryComputers, Jan 30 '14 at 14:14
@G-Nugget I added the (name, id), but again no obvious speed improvements. I will edit the post to show the new explain. — FuryComputers, Jan 30 '14 at 14:16
@bishop I appreciate all the good ideas. I will look into rewriting the query as you described. I have some idea what you mean, but I'm not sure how to rewrite it. If you have any resources for me to look at, they would be greatly appreciated. I also agree that this must have some obvious optimization that I am missing. — FuryComputers, Jan 30 '14 at 14:26
Also, just a note that the dateAdded does include seconds, they are important in this case. — FuryComputers, Jan 30 '14 at 14:27
Could you please update your post to include the `CREATE TABLE` so that we can replicate the entire problem domain? — bishop, Jan 30 '14 at 16:23
I tried several different things small changes to the SQL script but none of the changes could make my syntactically similar script run any faster with the changes than your current configuration. Good luck. — Christopher Brown, Jan 30 '14 at 18:12
@ChristopherBrown I appreciate the effort :). Its possible that it cannot be sped up. I remain hopeful. — FuryComputers, Jan 30 '14 at 18:18

score 7 · Accepted Answer · edited May 23 '17 at 12:28

7

There is a great Stack Overflow post on optimization of Selecting rows with the max value in a column: https://stackoverflow.com/a/7745635/633063

This seems a little messy but works very well:

SELECT example1.name, MAX(example1.id)
FROM exampletable example1
INNER JOIN (
select name, max(dateAdded) dateAdded
from exampletable
where dateAdded  <= '2014-01-20 12:00:00' 
group by name
) maxDateByElement on example1.name = maxDateByElement.name AND example1.dateAdded = maxDateByElement.dateAdded
GROUP BY name;

edited May 23 '17 at 12:28

Community

1
1

answered Feb 03 '14 at 19:40

HeavyE

2,132
1
23
31

From what I can tell, this is returning exactly the data I'm looking for, and it takes less than 1 second. – FuryComputers Feb 03 '14 at 19:43
@FuryComputers can u give it's explain result? – thekosmix Feb 04 '14 at 08:20
@thekosmix Added the explain to the question. Hope that helps you! – FuryComputers Feb 04 '14 at 18:26
But it is not necessarily correct! If the ids were not assigned in chronological order, this could deliver the wrong ids. (There are many processing situations that could cause them to be out of order.) On the other hand, This query may be _more correct_ than the OP's. It gets the "latest" (based on `dateAdded`), disambiguation based on `id`. – Rick James Nov 16 '21 at 17:21
In other words, "getting the "last id assigned" is not necessarily the same as "getting the last row added". – Rick James Nov 16 '21 at 17:23

score 2 · Answer 2 · answered Jan 31 '14 at 09:07

2

why are you using index on many keys?? if your where clause contains only one column, then use that index only, put index on dateAdded and on name separately and then use in sql statement like this:

SELECT MAX(id) as id, name 
FROM exampletable 
USE INDEX (dateAdded_index) USE INDEX FOR GROUP BY (name_index) 
WHERE dateAdded <= '2014-01-20 12:00:00' 
GROUP BY name
ORDER BY NULL;

here is the link if you want to know more. Please let me know, whether it is giving some positive results or not.

answered Jan 31 '14 at 09:07

thekosmix

1,705
21
35

That looks promising. I tried this and discovered that the MySQL instance is version 5.0.95. The index hint for GROUP BY doesn't work until 5.1. I will either test this on a different instance, or update the existing one, and let you know if it helps. – FuryComputers Jan 31 '14 at 19:48
From at least version 5.1.61 onward, [this query also performs a full table scan](http://www.sqlfiddle.com/#!2/3cb35/1). If you change it to `FORCE INDEX` then you do get range queries, but I suspect [you will have slower execution time even though the query plan is better](http://www.mysqlperformanceblog.com/2012/11/23/full-table-scan-vs-full-index-scan-performance/). – bishop Feb 03 '14 at 18:13
I tested on 5.6 with only marginal improvement in speed. Maybe a few seconds shaved off. – FuryComputers Feb 03 '14 at 19:42

score 0 · Answer 3 · answered Jan 30 '14 at 18:28

IF the where command makes no difference, then its either the max(id) or the name. I would test the indexes by eliminating Max(id) completely, and see if the group by name is fast. Then I would add Min(id) to see if its any faster than Max(id). (I have seen this make a difference).

Also, you should test the order by NULL. Try Order by name desc, or Order by name asc. Clark Vera

Improving speed of SQL query with MAX, WHERE, and GROUP BY on three different columns

3 Answers3

Linked