Improving MySQL Query Speeds - 150,000+ Rows Returned Slows Query

Question

Hi I currently have a query which is taking 11(sec) to run. I have a report which is displayed on a website which runs 4 different queries which are similar and all take 11(sec) each to run. I don't really want the customer having to wait a minute for all of these queries to run and display the data.

I am using 4 different AJAX requests to call an APIs to get the data I need and these all start at once but the queries are running one after another. If there was a way to get these queries to all run at once (parallel) so the total load time is only 11(sec) that would also fix my issue, I don't believe that is possible though.

Here is the query I am running:

SELECT device_uuid,
     day_epoch,
     is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)

I can't think of anyway to speed this query up at all, below are pictures of the table indexes and the explain statement on this query.

I think the above query is using relevant indexes in the where conditions.

If there is anything you can think of to speed this query up please let me know, I have been working on it for 3 days and can't seem to figure out the problem. It would be great to get the query times down to 5(sec) maximum. If I am wrong about the AJAX issue please let me know as this would also fix my issue.

" EDIT "

I have came across something quite strange which might be causing the issue. When I change the day_epoch range to something smaller (5th - 9th) which returns 130,000 rows the query time is 0.7(sec) but then I add one more day onto that range (5th - 10th) and it returns over 150,000 rows the query time is 13(sec). I have ran loads of different ranges and have came to the conclusion if the amount of rows returned is over 150,000 that has a huge effect on the query times.

Table Definition -

CREATE TABLE `tracking_daily_stats_zone_unique_device_uuids_per_hour` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `day_epoch` int(10) NOT NULL,
 `day_of_week` tinyint(1) NOT NULL COMMENT 'day of week, monday = 1',
 `hour` int(2) NOT NULL,
 `venue_id` int(5) NOT NULL,
 `zone_id` int(5) NOT NULL,
 `device_uuid` binary(16) NOT NULL COMMENT 'binary representation of the device_uuid, unique for a single day',
 `device_vendor_id` int(5) unsigned NOT NULL DEFAULT '0' COMMENT 'id of the device vendor',
 `first_seen` int(10) unsigned NOT NULL DEFAULT '0',
 `last_seen` int(10) unsigned NOT NULL DEFAULT '0',
 `is_repeat` tinyint(1) NOT NULL COMMENT 'is the device a repeat for this day?',
 `prev_last_seen` int(10) NOT NULL DEFAULT '0' COMMENT 'previous last seen ts',
 PRIMARY KEY (`id`,`venue_id`) USING BTREE,
 KEY `venue_id` (`venue_id`),
 KEY `zone_id` (`zone_id`),
 KEY `day_of_week` (`day_of_week`),
 KEY `day_epoch` (`day_epoch`),
 KEY `hour` (`hour`),
 KEY `device_uuid` (`device_uuid`),
 KEY `is_repeat` (`is_repeat`),
 KEY `device_vendor_id` (`device_vendor_id`)
) ENGINE=InnoDB AUTO_INCREMENT=450967720 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (venue_id)
PARTITIONS 100 */

I believe MySQL generally can only take advantage of one index at a time, so having the fields indexed separately may not be the best option; I'd suggest trying maybe creating a composite index on `(venue_id, day_epoch)` or `(venue_id, zone_id, day_epoch)`. ...also, including your table's CREATE in the question never hurts. — Uueerdo, Mar 22 '19 at 16:32
`GROUP BY` is typically for aggregation, are you wanting a semi-random value selected the ungrouped fields? — Uueerdo, Mar 22 '19 at 16:35
@Uueerdo the above query isn't the entire query I have an outer select which needs that ```GROUP BY``` but I didn't bother showing that because that isn't the issue, the inner query (shown) is slowing it down. I will have a go at creating that composite index. — Lukerayner, Mar 22 '19 at 16:50
If the GROUP BY is for an outer query it should not need included in what you're posting; and the original query should have had a `)` separating what you've posted from the outer portion before the GROUP BY. — Uueerdo, Mar 22 '19 at 16:58
@Uueerdo The reason the ```GROUP BY``` is included is because that might be slowing the query down. I understand you didn't know why it was there which I have now explained. I think the issue is due to the edit that I have made. — Lukerayner, Mar 22 '19 at 17:22
Values chosen for unaggregated fields that not included in the GROUP BY effectively random choices from the values encountered for the group they accompany in the final results... technically, MySQL does not even guarantee the two references you have to `hour` will even come from the same source row. — Uueerdo, Mar 22 '19 at 17:26
Ok sorry that makes sense. I will remove the two ```hour``` references from the select because I shouldn't have them there. Do you know why the 150,000+ rows might be slowing it down so much? — Lukerayner, Mar 22 '19 at 17:34
I am not saying the hour references would be different; in fact, they'll almost certainly be the same. I was just saying MySQL doesn't actually guarantee they would be. You could still end up with issues with is_repeat, if zone 102 has is_repeat 1 and zone 105 has is_repeat 0, you could get either...and you might not even get the same value if you run the query twice. I can't be completely sure about performance on a query considered invalid by default configurations on recent versions of MySQL. What is the performance for just that one additional day by itself? — Uueerdo, Mar 22 '19 at 17:45
Ok that makes sense I will look into it and the performace for that one day is 0.2(sec) if I change the range to 6th - 10th the query speed is 0.9(sec) so it isnt that day which is causing the issue. The reason I mentioned 150,000+ rows is because I looked into the issue on a completely different range which only looked between 11th - 13th and that returned 130,000 rows but when I looked at 11th - 14th which had 150,000+ rows the query time was 13(sec) again. — Lukerayner, Mar 22 '19 at 17:51
One other consideration is that OR conditions can often destroy performance; and while recent versions of MySQL have made some optimizations to `IN`, historically it was basically handled as a list of `OR a = b` conditions.... but if you need the list, you need the list. — Uueerdo, Mar 22 '19 at 17:56
if that is the part of some other query please remove `GROUP BY` from this query. Right now looking at it - it makes no sense to `GROUP BY` and to have no aggregation functions involved — Alex, Mar 22 '19 at 17:56
and you can confirm and proof that this query with no `GROUP BY` takes more than half a second? How do you measure the speed? — Alex, Mar 22 '19 at 18:06
@Alex Yeah I just doubled checked and without the ```GROUP BY``` the query took 13 seconds to run. I am using PHPMYADMIN to measure the speed, I know it might not be 100% accurate but because the difference in speeds is so large I doubt the accuracy matters too much. I have ran the actual reports which use these queries and the timings match. — Lukerayner, Mar 22 '19 at 18:11
What is your server CPU and MEMORY? seem you have more than 450M records there. I think you can start to think about partitions. — Alex, Mar 22 '19 at 18:16
Another question, do you really need to get 150K records back through ajax? Probably you can LIMIT results? — Alex, Mar 22 '19 at 18:27
@Alex The table is partitioned on ```venue_id``` already and ajax doesn't actually receive 150k rows of data. I have an outer query which isn't shown which reduced that down to around 10 rows. The outer query isn't shown because it isn't affecting the query speeds. — Lukerayner, Mar 22 '19 at 18:29
It is affecting. MySQL server has in-build query analyzer and optimizer and if you do some `JOIN`s for example it will cut some data. And if we can reduce the result we need to know what is unnecessary part here. Probably there is a way to switch inner and outer query and improve the performance — Alex, Mar 22 '19 at 18:33
I don't have the code base at my house. No joins are involved and I know that when I add the outer loop the query time isn't changed. The part of the query which needs optimising is the inner query which is shown. All the data that the inner query is collecting (150k) rows is needed. The outer query is basically using that data to then ```GROUP BY day_epoch``` and count the number of rows that have ```is_repeat``` set to 1 and number of rows that have it set to 0. — Lukerayner, Mar 22 '19 at 18:40
Sorry I don't think I have explained myself. The query I am running is basically going to give me the total unique new/repeat visitors per day. The reason I originally ```GROUP BY day_epoch, device_uuid``` is to get all the unique visitors per day. My outer query then uses the results returned and ```GROUP BY day_epoch``` to count the total number of unique new/repeat visitors there are per day. — Lukerayner, Mar 22 '19 at 19:12
I would try and start a discussion but I don't have enough rep... — Lukerayner, Mar 22 '19 at 19:12
To get unique visitors per day should just be something as simple as `SELECT day_epoch, COUNT(DISTINCT IF(is_repeat = 0, NULL, device_uuid)) AS newCount, COUNT(DISTINCT IF(is_repeat != 0, device_uuid)) AS repeatCount FROM tracking_daily_stats_zone_unique_device_uuids_per_hour WHERE conditions GROUP BY day_epoch` ...assuming I am understanding your is_repeat semantics close to accurately. — Uueerdo, Mar 22 '19 at 22:42

Alex · Accepted Answer · 2019-03-22T21:04:28.760

1

The straight forward solution is to add this query specific index to the table:

ALTER TABLE tracking_daily_stats_zone_unique_device_uuids_per_hour 
ADD INDEX complex_idx (`venue_id`, `day_epoch`, `zone_id`)

WARNING This query change can take a while on DB.

And then force it when you call:

SELECT device_uuid,
     day_epoch,
     is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
USE INDEX (complex_idx)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)

It is definitely not universal but should work for this particular query.

UPDATE When you have partitioned table you can get profit by forcing particular PARTITION. In our case since that is venue_id just force it:

SELECT device_uuid,
     day_epoch,
     is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
PARTITION (`p46`)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)

Where p46 is concatenated string of p and venue_id = 46

And another trick if you go this way. You can remove AND venue_id = 46 from WHERE clause. Because there is no other data in that partition.

edited Mar 22 '19 at 21:04

answered Mar 22 '19 at 18:13

Alex

16,739
1
28
51

I am currently adding the index but I have so much data its taking a while :) – Lukerayner Mar 22 '19 at 18:30
I tried using this index and waited around 5 minutes and got a 500 error. – Lukerayner Mar 22 '19 at 18:54
yeah... check your server if that query is still executing. And since you said your table is already partitioned you can simplify this new index just to have 2 columns `ADD INDEX complex_idx (\`day_epoch\`, \`zone_id\`)`. But in that case I think you should not force using this index but allow the server optimizer to analyze the query. – Alex Mar 22 '19 at 19:03
I just checked and the query which failed is no longer running. So is now better to not force the index? Do you think that is could be a SQL configuration? We are using a AWS RDS server with 30GB RAM and 8 vCPU, I doubt the server specs are the issue. – Lukerayner Mar 22 '19 at 19:08
Yes, I agree server hardware seem fine. Just try one more time to create the index but simpler one with no `venue_id` since it is already partitioned. Another question about partitions. Do you have a way to detect partition id from `venue_id`? If yes we can add partition forcing to the query – Alex Mar 22 '19 at 19:13
Ok I will do. I'm not sure how I would get the partition id in all honesty, I have never had to do it before. – Lukerayner Mar 22 '19 at 19:18
did you cut some info there `/*!50100 PARTITION BY HASH (venue_id) PARTITIONS 100 */` ? – Alex Mar 22 '19 at 19:19
I just copied and pasted the exact output. I'm not even sure what that comment means tbh. – Lukerayner Mar 22 '19 at 19:22
I just tried forcing the partition and the query runs but it doesn't increase the speeds unfortunately. I also tried the new index and that crashed as well. – Lukerayner Mar 22 '19 at 19:51
When you've forced partition did you try to remove `AND venue_id = 46` from `WHERE`? – Alex Mar 22 '19 at 20:03
Oh I didn't realise thats what you meant sorry. I just tried it and it is super quick now! When I try a day_epoch range of a month the query takes 0.0012 seconds to run. Thank you so much. How do you know the partition id? Will it always match the venue_id or was that just a guess? – Lukerayner Mar 22 '19 at 20:35
Yes it is matching to your `venue_id` and I think it is in range 1-100 because you have that in table definition `/*!50100 PARTITION BY HASH (venue_id) PARTITIONS 100 */` – Alex Mar 22 '19 at 20:41
Yeah I only created 100 partitions, now I understand what that means. So if I have 81 venues but for some reason the 80th venue has an id of 100 does that mean the 81st venue will have a partition id of 101? – Lukerayner Mar 22 '19 at 20:48
I have no that deep knowledge when it comes to `PARTITION`s you can try to ask https://stackoverflow.com/users/1491895/barmar or some other guru – Alex Mar 22 '19 at 20:52
Ok thank you for all the time you have spent on this with me. – Lukerayner Mar 22 '19 at 20:59
No problem. I really like to solve performance issues. That is always very interesting and I learn something new myself every time – Alex Mar 22 '19 at 21:00
Hi mate do you have skype? I just spoke to my boss and he is proper chuffed with the results so he just wanted to say thank you :) I'm not sure if you can send PM on here. – Lukerayner Mar 22 '19 at 21:38
skype: dzsysop :-) – Alex Mar 22 '19 at 21:59
`PARTITION BY HASH` is virtually useless. It will be especially bad when two `venue_ids` map to the same partition. `PARTITION BY RANGE` _may_ be useful. – Rick James Apr 16 '19 at 22:09
@Alex - it is almost never useful to _start_ a composite index with a column being tested with a range (`day_epoch`). – Rick James Apr 16 '19 at 22:52
"0.0012 seconds to run" -- that is so fast that it smells fishy. – Rick James Apr 16 '19 at 22:54

Martijn · Answer 2 · 2019-03-22T19:57:45.810

0

What happens if you change the order of conditions? Put venue_id = ? first. The order matters.

Now it first checks all rows for:
- day_epoch >= 1552435200
- then, the remaining set for day_epoch < 1553040000
- then, the remaining set for venue_id = 46
- then, the remaining set for zone_id IN (102,105,108,110,111,113,116,117,118,121,287)

When working with heavy queries, you should always try to make the first "selector" the most effective. You can do that by using a proper index for 1 (or combination) index and to make sure that first selector narrows down the most (at least for integers, in case of strings you need another tactic).

Sometimes, a query simply is slow. When you have a lot of data (and/or not enough resources) you just cant really do anything about that. Thats where you need another solution: Make a summary table. I doubt you show 150.000 rows x4 to your visitor. You can sum it, e.g., hourly or every few minutes and select from that way smaller table.

^{Offtopic: Putting an index on everything only slows you down when inserting/updating/deleting. Index the least amount of columns, just the once you actually filter on (e.g. use in a WHERE or GROUP BY).}

edited Mar 22 '19 at 19:57

answered Mar 22 '19 at 19:53

Martijn

15,791
4
36
68

That does make sense. I gave it a go but unfortunately it hasn't increased the query speed. I just find it odd that by adding an extra day to the date range it goes from 0.7 seconds to 13 seconds – Lukerayner Mar 22 '19 at 19:55
You meen between the day_epochs? No, makes sense. Means that the date filter 'returns' a lot to the next condition. It could be that is the tippingpoint for your server/resources :) – Martijn Mar 22 '19 at 19:59
Yeah for example when I query between 5th and 9th it takes 0.7 seconds but then when I query between 5th 10th it takes 13 seconds. So just a bit of extra data is having a huge effect. Could this be to do with the MySQL config or something? Just as a side note each day returns around 20k records so by adding one more day it isn't like that extra day has 100k+ records so I wouldn't think the query speed would be effected so much. – Lukerayner Mar 22 '19 at 20:31
@Martijn - The Optimizer can and will rearrange the `WHERE` clause items. There is no advantage to manually rearranging them. On the other hand, the order of columns in a composite index can be _very_ important. – Rick James Apr 16 '19 at 22:05
@Lukerayner - Quite possibly the Optimizer picked a different index to use, but it hurt more than helped. Provide `EXPLAIN SELECT ...` for the two cases (if possible). – Rick James Apr 16 '19 at 22:07

score 0 · Answer 3 · answered Apr 16 '19 at 22:51

450M rows is rather large. So, I will discuss a variety of issues that can help.

Shrink data A big table leads to more I/O, which is the main performance killer. ('Small' tables tend to stay cached, and not have an I/O burden.)

Any kind of INT, even INT(2) takes 4 bytes. An "hour" can easily fit in a 1-byte TINYINT. That saves over a 1GB in the data, plus a similar amount in INDEX(hour).
If hour and day_of_week can be derived, don't bother having them as separate columns. This will save more space.
Some reason to use a 4-byte day_epoch instead of a 3-byte DATE? Or perhaps you do need a 5-byte DATETIME or TIMESTAMP.

Optimal INDEX (take #1)

If it is always a single venue_id, then either this is a good first cut at the optimal index:

INDEX(venue_id, zone_id, day_epoch)

First is the constant, then the IN, then a range. The Optimizer does well with this in many cases. (It is unclear whether the number of items in an IN clause can lead to inefficiencies.)

Better Primary Key (better index)

With AUTO_INCREMENT, there is probably no good reason to include columns after the auto_inc column in the PK. That is, PRIMARY KEY(id, venue_id) is no better than PRIMARY KEY(id).

InnoDB orders the data's BTree according to the PRIMARY KEY. So, if you are fetching several rows and can arrange for them to be adjacent to each other based on the PK, you get extra performance. (cf "Clustered".) So:

PRIMARY KEY(venue_id, zone_id, day_epoch,  -- this order, as discussed above;
            id)    -- to make sure that the entire PK is unique.
INDEX(id)      -- to keep AUTO_INCREMENT happy

And, I agree with DROPping any indexes that are not in use, including the one I recommended above. It is rarely useful to index flags (is_repeat).

UUID

Indexing a UUID can be deadly for performance once the table is really big. This is because of the randomness of UUIDs/GUIDs, leading to ever-increasing I/O burden to insert new entries in the index.

Multi-dimensional

Assuming day_epoch is sometimes multiple days, you seem to have 2 or 3 "dimensions":

A date range
A list of zones
A venue.

INDEXes are 1-dimensional. Therein lies the problem. However, PARTITIONing can sometimes help. I discuss this briefly as "case 2" in http://mysql.rjweb.org/doc.php/partitionmaint .

There is no good way to get 3 dimensions, so let's focus on 2.

You should partition on something that is a "range", such as day_epoch or zone_id.
After that, you should decide what to put in the PRIMARY KEY so that you can further take advantage of "clustering".

Plan A: This assumes you are searching for only one venue_id at a time:

PARTITION BY RANGE(day_epoch)  -- see note below

PRIMARY KEY(venue_id, zone_id, id)

Plan B: This assumes you sometimes srefineearch for venue_id IN (.., .., ...), hence it does not make a good first column for the PK:

Well, I don't have good advice here; so let's go with Plan A.

The RANGE expression must be numeric. Your day_epoch works fine as is. Changing to a DATE, would necessitate BY RANGE(TO_DAYS(...)), which works fine.

You should limit the number of partitions to 50. (The 81 mentioned above is not bad.) The problem is that "lots" of partitions introduces different inefficiencies; "too few" partitions leads to "why bother".

Note that almost always the optimal PK is different for a partitioned table than the equivalent non-partitioned table.

Note that I disagree with partitioning on venue_id since it is so easy to put that column at the start of the PK instead.

Analysis

Assuming you search for a single venue_id and use my suggested partitioning & PK, here's how the SELECT performs:

Filter on the date range. This is likely to limit the activity to a single partition.
Drill into the data's BTree for that one partition to find the one venue_id.
Hopscotch through the data from there, landing on the desired zone_ids.
For each, further filter based the date.

Improving MySQL Query Speeds - 150,000+ Rows Returned Slows Query

3 Answers3