7

I'm trying to troubleshoot a performance issue on MySQL, so I wanted to create a smaller version of a table to work with. When I add a LIMIT clause to the query, it goes from about 2 seconds (for the full insert) to astronomical (42 minutes).

mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15'
group by pr.player_id;
+------------+-------------+
| 1002395119 | 2012-05-14  |
...
| 1002395157 | 2012-05-14  |
| 1002395187 | 2012-05-14  |
| 1002395475 | 2012-05-14  |
+------------+-------------+
105776 rows in set (2.19 sec)

mysql> select pr.player_id, max(pr.insert_date) as insert_date from player_record pr
inner join date_curr dc on pr.player_id = dc.player_id where pr.insert_date < '2012-05-15' 
group by pr.player_id limit 1;
+------------+-------------+
| player_id  | insert_date |
+------------+-------------+
| 1000000080 | 2012-05-14  |
+------------+-------------+
1 row in set (42 min 23.26 sec)

mysql> describe player_record;
+------------------------+------------------------+------+-----+---------+-------+
| Field                  | Type                   | Null | Key | Default | Extra |
+------------------------+------------------------+------+-----+---------+-------+
| player_id              | int(10) unsigned       | NO   | PRI | NULL    |       |
| insert_date            | date                   | NO   | PRI | NULL    |       |
| xp                     | int(10) unsigned       | YES  |     | NULL    |       |
+------------------------+------------------------+------+-----+---------+-------+
17 rows in set (0.01 sec) (most columns removed)

There are 20 million rows in the player_record table, so I am creating two tables in memory for the specific dates I am looking to compare.

CREATE temporary TABLE date_curr 
(
      player_id INT UNSIGNED NOT NULL, 
      insert_date DATE,     
      PRIMARY KEY player_id (player_id, insert_date)
 ) ENGINE=MEMORY;
INSERT into date_curr 
SELECT  player_id, 
        MAX(insert_date) AS insert_date 
FROM player_record 
WHERE insert_date BETWEEN '2012-05-15' AND '2012-05-15' + INTERVAL 6 DAY
GROUP BY player_id;

CREATE TEMPORARY TABLE date_prev LIKE date_curr;
INSERT into date_prev 
SELECT pr.player_id,
       MAX(pr.insert_date) AS insert_date 
FROM  player_record pr 
INNER join date_curr dc 
      ON pr.player_id = dc.player_id 
WHERE pr.insert_date < '2012-05-15' 
GROUP BY pr.player_id limit 0,20000;

date_curr has 216k entries, and date_prev has 105k entries if I don't use a limit.

These tables are just part of the process, used to trim down another table (500 million rows) to something manageable. date_curr includes the player_id and insert_date from the current week, and date_prev has the player_id and most recent insert_date from BEFORE the current week for any player_id present in date_curr.

Here is the explain output:

mysql> explain SELECT pr.player_id, 
                      MAX(pr.insert_date) AS insert_date 
               FROM   player_record pr 
               INNER  JOIN date_curr dc 
                      ON pr.player_id = dc.player_id
               WHERE  pr.insert_date < '2012-05-15' 
               GROUP  BY pr.player_id 
               LIMIT  0,20000;                    
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type  | possible_keys       | key         | key_len | ref  | rows   | Extra                                        |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
|  1 | SIMPLE      | pr    | range | PRIMARY,insert_date | insert_date | 3       | NULL     | 396828 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | dc    | ALL   | PRIMARY             | NULL        | NULL    | NULL | 216825 | Using where; Using join buffer               |
+----+-------------+-------+-------+---------------------+-------------+---------+------+--------+----------------------------------------------+
2 rows in set (0.03 sec)

This is on a system with 24G RAM dedicated to the database, and currently is pretty much idle. This specific database is the test so it is completely static. I did a mysql restart and it still has the same behavior.

Here is the 'show profile all' output, with most time being spent on copying to tmp table.

| Status               | Duration   | CPU_user   | CPU_system | Context_voluntary | Context_involuntary | Block_ops_in | Block_ops_out | Messages_sent | Messages_received | Page_faults_major | Page_faults_minor | Swaps | Source_function       | Source_file   | Source_line |
| Copying to tmp table | 999.999999 | 999.999999 |   0.383941 |            110240 |               18983 |        16160 |           448 |             0 |                 0 |                 0 |                43 |     0 | exec                  | sql_select.cc |        1976 |
Josh
  • 303
  • 3
  • 10
  • I suspect your problem is the filesort however I am more worried about what the queries are doing. For example you add rows into date_curr then add rows to date_prev. Then in your query you select from date_curr and never use date_prev in the query? Please clarify the example a little. Also which part of the procedure is taking the longest? Is it the select? Is it the insert? Please clarify. – Namphibian Dec 13 '12 at 05:51
  • How about a `show create table player_record` so we can see all the other stuff like indices etc. I bet if you add an index on the date columns you are querying against would speed it up. – Mike Purcell Dec 13 '12 at 06:04
  • I have an index on insert_date and it is the key used as shown in the explain statement. The query without limit returns in 2 seconds (which requires assembling the complete response); I can't figure out how limiting the output would be slower than providing the entire output. date_curr is taking the player_record entry form the current week, date_prev is taking the most recent player_record before the current week. The date_prev query (with the limit) is the slow one, and it's slow just as the standalone query without doing an insert. – Josh Dec 13 '12 at 06:25
  • It appears you are seriously prematurely optimizing. If you only have 20 million records you should be able to do most reasonable work in the tables if properly indexed. (This is the type of quandary you also see when people start using temp tables.) Your table is simple enough that even 500MM rows should be manageable. With SQL "divide and conquer" may make it easier to read and write, but it invariably takes longer. You're never as good at it as an SQL query optimizer as long as you treat it respectfully. – dkretz Dec 13 '12 at 07:45
  • @Josh your table has a primary key on playerid and insert_date right? Then you indexed the insert_date again? That might be the problem if I am correct. Do me favour drop the index on insert_date and run the query again MySQL might be using the wrong index here. This could be due to the fact that this is a static database and MySQL might have wrong statistics. – Namphibian Dec 13 '12 at 09:31
  • le dorfier: That may or may not be; I did a lot of playing around and was not successful at getting it to optimize correctly. I ran into this while looking at alternatives and want to figure it out before proceeding, in case it represents a deeper problem. I'll try to force use of the primary key and also deleting the other index, but again **it works just fine if I don't limit the output**. – Josh Dec 13 '12 at 17:16
  • @Namphibian post that as the answer. – Josh Dec 13 '12 at 18:29
  • Like I said, should have posted the `show create table`. Nice guess without all the info Namphibian. – Mike Purcell Dec 14 '12 at 01:47
  • @MikePurcell There was a lot info pointing to a wrong index being used. The possible_keys showed me two indexes and the index used was one of the two values. The filesort based on the GROUP BY player_id was also a major clue. So mostly a guess based on solid evidence. – Namphibian Dec 14 '12 at 07:02
  • @Namphibian: I never would have guessed he setup a composite PK with player_id and insert_date, as I have never seen a schema setup like this, which is why I was harping on the show create table. – Mike Purcell Dec 14 '12 at 07:11
  • @MikePurcell this is why I love databases so much you learn something every day. Without that I would get bored and drink waaaaaay to much! – Namphibian Dec 14 '12 at 07:21
  • Guessing it's Mike's downvote. Am I misreading the 'describe player_record' output or does it not show the composite primary key? Anyway, the most confusing part was that it changed the index (and join order) based on 'LIMIT 1'. – Josh Dec 14 '12 at 15:43

2 Answers2

11

A bit of a long answer but I hope you can learn something from this.

So based on the evidence in the explain statement you can see that there was two possible indexes that the MySQL query optimizer could have used they are as follows:

possible_keys
PRIMARY,insert_date 

However the MySQL query optimizer decided to use the following index:

key
insert_date

This is a rare occasion where MySQL query optimizer used the wrong index. Now there is a probable cause for this. You are working on a static development database. You probably restored this from production to do development against.

When the MySQL optimizer needs to make a decision on which index to use in a query it looks at the statistics around all the possible indexes. You can read more about statistics here http://dev.mysql.com/doc/innodb-plugin/1.0/en/innodb-other-changes-statistics-estimation.html for a starter.

So when you update, insert and delete from a table you change the index statistics. It might be that the MySQL server because of the static data had the wrong statistics and chose the wrong index. This however is just a guess at this point as a possible root cause.

Now lets dive into the indexes. There was two possible indexes to use the primary key index and the index on insert_date. MySQL used the insert_date one. Remember during a query execution MySQL can only use one index always. Lets look at the difference between the primary key index and the insert_date index.

Simple fact about a primary key index(aka clustered):

  1. A primary key index is normally a btree structure that contains the data rows i.e. it is the table as it contains the date.

Simple fact about secondary index(aka non-clustered):

  1. A secondary index is normally a btree structure that contains the data being indexed(the columns in the index) and a pointer to the location of the data row on the primary key index.

This is a subtle but big difference.

Let me explain when you read a primary key index you are reading the table. The table is in order of the primary index as well. Thus to find a value I would search the index read the data which is 1 operation.

When you read a secondary index you search the index find the pointer then read the primary key index to find the data based on the pointer. This is essentially 2 operations making the operation of reading a secondary index twice as costly as reading the primary key index.

In your case since it chose the insert_date as the index to use it was doing double the work just to do the join. That is problem one.

Now when you LIMIT a recordset it is the last piece of execution of the query. MySQL has to take the entire recordset sort it (if not sorted allready) based on ORDER BY and GROUP BY conditions then take the number of records you want and send it back based on the LIMIT BY section. MySQL has to do a lot of work to keep track of records to send and where it is in the record set etc. LIMIT BY does have a performance hit but I suspect there might be a contributing factor read on.

Look at your GROUP BY it is by player_id. The index that is used is insert_date. GROUP BY essentially orders your record set, however since it had no index to use for ordering (remember a index is sorted in the order of the column(s) contained in it). Essentially you were asking sort/order on player_id and the index used was sorted on insert_date.

This step caused the filesort problem which essentially takes the data that is returned from reading the secondary index and primary index(remember the 2 operations) and then has to sort them. Sorting is normally done on disk as it is a very very expensive operation to do in memory. Thus the entire query result was written to disk and sorted painfully slow to get you your results.

By removing the insert_date index MySQL will now use the primary key index which means the data is ordered(ORDER BY/GROUP BY) player_id and insert_date. This will eliminate the need to read the secondary index and then use the pointer to read the primary key index i.e. the table, and since the data is already sorted MySQL has very little work when applying the GROUP BY piece of the query.

Now the following is a bit of a educated guess again if you could post the results of the explain statement after the index was dropped I would probably be able to confirm my thinking. So by using the wrong index the results were sorted on disk to apply the LIMIT BY properly. Removing the LIMIT BY allows MySQL to probably sort in Memory as it does not have to apply the LIMIT BY and keep track of what is being returned. The LIMIT BY probably caused the temporary table to be created. Once again difficult to say without seeing the difference between the statements i.e. output of explain.

Hopefully this gives you a better understanding of indexes and why they are a double edged sword.

Namphibian
  • 12,046
  • 7
  • 46
  • 76
  • Using LIMIT switched the index and the order of the join - that was my big confusion and shouldn't happen IMO. The query joined date_curr then player_record using PRIMARY, but add 'LIMIT 1' and it joined player_record then date_curr using insert_date. I submitted a bug report. Using FORCE INDEX worked fine; no need to delete. I also found that after 5/30/12 it worked. My stats gathering started at 5/14 and I think the small number of insert_date fooled it, but again no reason that the limit should change the query setup. Thanks for making me take a second look at the indexes. – Josh Dec 14 '12 at 15:31
  • Great answer, but force index is the way to go :) – Konstantin Krass Jan 15 '20 at 13:52
  • @KonstantinKrass till the great index is not efficient at answering the query and then you have to force another index. At some point the the index will become a problem thus it is better to adjust indexes than to force it's use. – Namphibian Jan 31 '20 at 03:32
  • @Namphibian You are right. Always forcing the index is not performing great. It will take some refactoring to make it work then. – Konstantin Krass Feb 02 '20 at 17:32
1

Had the same problem. When I added FORCE INDEX (id) it went back to the few milliseconds of a query it was without the limit, while producing the same results.

gilm
  • 7,690
  • 3
  • 41
  • 41