Accelerate large MySQL queries using LIMITs

Question

I'm having a problem with SQL and slow response times.

It's not because I'm missing indexes but, because the results are HUGE. I'm using these to feed an API response where this is delivered to an external client. To make the responses manageable, I'm using pagination. Eventually large pages eventually slow down to the point where it takes 800 seconds to complete. That means the web service is hanging around for that long waiting to deliver the response.

mysql> EXPLAIN SELECT * FROM externallinks_global LEFT JOIN externallinks_paywall ON externallinks_global.paywall_id=externallinks_paywall.paywall_id WHERE `has_archive` = 1 LIMIT 1385000,1001;
+------+-------------+-----------------------+--------+------------------------------------------------------------------+------------+---------+--------------------------------------------------+---------+-------+
| id   | select_type | table                 | type   | possible_keys                                                    | key        | key_len | ref                                              | rows    | Extra |
+------+-------------+-----------------------+--------+------------------------------------------------------------------+------------+---------+--------------------------------------------------+---------+-------+
|    1 | SIMPLE      | externallinks_global  | ref    | HASARCHIVE,APIINDEX7,APIINDEX10,APIINDEX11,APIINDEX12,APIINDEX13 | APIINDEX13 | 1       | const                                            | 4291236 |       |
|    1 | SIMPLE      | externallinks_paywall | eq_ref | PRIMARY                                                          | PRIMARY    | 4       | s51059__cyberbot.externallinks_global.paywall_id |       1 |       |
+------+-------------+-----------------------+--------+------------------------------------------------------------------+------------+---------+--------------------------------------------------+---------+-------+
2 rows in set (0.01 sec)

The above is a problem query being explained. This took 800 seconds to execute. It's pretty well indexed as can be seen. My question is how can I get the results almost instantly when fetching a chunk of the result deep within the large set? Is there some way to do this?

Here is the table the query is being run on and the table joining it:

CREATE TABLE IF NOT EXISTS `externallinks_global` (
                              `url_id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
                              `paywall_id` INT UNSIGNED NOT NULL,
                              `url` VARCHAR(767) NOT NULL,
                              `archive_url` BLOB NULL,
                              `has_archive` TINYINT UNSIGNED NOT NULL DEFAULT '0',
                              `live_state` TINYINT UNSIGNED NOT NULL DEFAULT '4',
                              `last_deadCheck` TIMESTAMP NOT NULL DEFAULT '0000-00-00 00:00:00',
                              `archivable` TINYINT UNSIGNED NOT NULL DEFAULT '1',
                              `archived` TINYINT UNSIGNED NOT NULL DEFAULT '2',
                              `archive_failure` BLOB NULL DEFAULT NULL,
                              `access_time` TIMESTAMP NOT NULL,
                              `archive_time` TIMESTAMP NULL DEFAULT NULL,
                              `reviewed` TINYINT UNSIGNED NOT NULL DEFAULT '0',
                              PRIMARY KEY (`url_id` ASC),
                              UNIQUE INDEX `url_UNIQUE` (`url` ASC),
                              INDEX `LIVE_STATE` (`live_state` ASC),
                              INDEX `LAST_DEADCHECK` (`last_deadCheck` ASC),
                              INDEX `PAYWALLID` (`paywall_id` ASC),
                              INDEX `REVIEWED` (`reviewed` ASC),
                              INDEX `HASARCHIVE` (`has_archive` ASC),
                              INDEX `ISARCHIVED` (`archived` ASC),
                              INDEX `APIINDEX1` (`live_state` ASC, `paywall_id` ASC),
                              INDEX `APIINDEX2` (`live_state` ASC, `paywall_id` ASC, `archived` ASC),
                              INDEX `APIINDEX3` (`live_state` ASC, `paywall_id` ASC, `reviewed` ASC),
                              INDEX `APIINDEX4` (`live_state` ASC, `archived` ASC),
                              INDEX `APIINDEX5` (`live_state` ASC, `reviewed` ASC),
                              INDEX `APIINDEX6` (`archived` ASC, `reviewed` ASC),
                              INDEX `APIINDEX7` (`has_archive` ASC, `paywall_id` ASC),
                              INDEX `APIINDEX8` (`paywall_id` ASC, `archived` ASC),
                              INDEX `APIINDEX9` (`paywall_id` ASC, `reviewed` ASC),
                              INDEX `APIINDEX10` (`has_archive` ASC, `live_state` ASC, `paywall_id` ASC, `archived` ASC, `reviewed` ASC),
                              INDEX `APIINDEX11` (`has_archive` ASC, `archived` ASC, `reviewed` ASC),
                              INDEX `APIINDEX12` (`has_archive` ASC, `live_state` ASC, `paywall_id` ASC),
                              INDEX `APIINDEX13` (`has_archive` ASC, `live_state` ASC),
                              INDEX `APIINDEX14` (`has_archive` ASC, `live_state` ASC, `paywall_id` ASC, `reviewed` ASC),
                              INDEX `APIINDEX15` (`has_archive` ASC, `live_state` ASC, `reviewed` ASC),
                              INDEX `APIINDEX16` (`has_archive` ASC, `paywall_id` ASC, `reviewed` ASC));

and

CREATE TABLE IF NOT EXISTS `externallinks_paywall` (
                              `paywall_id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
                              `domain` VARCHAR(255) NOT NULL,
                              `paywall_status` TINYINT UNSIGNED NOT NULL DEFAULT 0,
                              PRIMARY KEY (`paywall_id` ASC),
                              UNIQUE INDEX `domain_UNIQUE` (`domain` ASC),
                              INDEX `PAYWALLSTATUS` (`paywall_status` ASC));

The global table has roughly 27 million rows, and the paywall table has about 16 million.

There may (or may not) be a way to improve your query, your data/datapreparation or your setup, but all we know are the names of your indexes. That's like "I have a car. It is built from a lot of parts, I named one of these 'api13' (not sure if it is actually a useful part though). The car makes stranges noises and doesn't drive well. What should I do?" Without a lot more details, we can only give you very general tips, so try e.g. [MySQL Data - Best way to implement paging?](https://stackoverflow.com/questions/3799193/mysql-data-best-way-to-implement-paging) and all the links there. — Solarflare, Jun 22 '17 at 11:34
Thank you for your comment. What additional data are you looking for? The table itself? Table size? — Cyberpower678, Jun 22 '17 at 12:18
@Solarflare I have added more details. If you need more details, please let me know. — Cyberpower678, Jun 22 '17 at 13:40
In addition to the table details: the important part is the query itself (you can simplify it and e.g. remove private information if it doesn't alter how the query works (join and where and order)). Also some details about your data/result in general (e.g.: "a usual query will find 4% of the 27 million rows, and each row of table 1 is joined to about 5 rows from table 2"), and e.g. if your data is static, if data is only appended or can change everywhere. Maybe some example data to help to understand your query if it is complicated. Other questions may arise after we see your query. — Solarflare, Jun 22 '17 at 20:07
@Solarflare Sorry for the delay in responding. I've been busy. The query I'm using is the one being explained above, with the LIMIT. I'm not sure if that is what you're looking for. As seen in the explain, it has 4291236 rows of the 27 million. This is a many-to-one table. So the joins have multiple rows on the left, matching a single row on the right. If you need more, let me know. My SQL skills are intermediate at best, so I am learning a lot here. — Cyberpower678, Jun 27 '17 at 15:51
Sorry, I totally missed the query there, that was what I wanted to know. Wrote you an answer. — Solarflare, Jun 30 '17 at 10:59

score 0 · Accepted Answer · answered Jun 30 '17 at 10:58

If you are using offset 1000000, MySQL has to internally generate and then throw away the first 1 million rows of your result just to send you 1000 rows. And for the next page with offset 1001000, it will have to generate these 1 million (plus 1000) rows again. This will obviously take more and more time, and on page 1000, MySQL will have read the rows for page 1 a thousand times, and discarded them 999 times.

First thing you should make sure is that you are using "server-side paging" instead of "client-side paging". This is a connection setting in your client's environment (in this case your api). While the name also include the word "page" (because it is a similar concept), it is not only used for your kind of paging, and you should usually enable it (and oftentimes it is enabled by default).

This might actually already solve your problem: your api will send a request for the whole resultset to the server, and then fetch it row by row and pass it to the client. The caching is done on the server, so you do not need to retrieve the whole resultset ("client-side paging") or store more than one row in the api.

If that is no option for you (you should check it twice or 10 times though before you neglect it because you don't get it to work the first time), you can optimize your query. Since you are retrieving the pages one after the other and every row in your result has its unique url_id, you can use the last read url_id to really skip all previous rows instead of reading and discarding them. This will just require a single index lookup, and every page will take about the same amount of time. So try

 SELECT * FROM externallinks_global 
 LEFT JOIN externallinks_paywall 
 ON externallinks_global.paywall_id=externallinks_paywall.paywall_id 
 WHERE url_id > :LAST_READ_ID and has_archive = 1 
 ORDER BY url_id
 LIMIT 1000;

Replace :LAST_READ_ID with the last url_id of your previous page. You have to manage that value in the api, same as you currently store the offset.

Please notice the order by. Not using it can give you inconsistent and unexpected results, because if you don't set an order, MySQL is allowed to randomly decide on a different one for the next page, and you might get rows several times or not at all.

You have to consider some things:

you cannot jump to any random page (since you need to know the last received url_id), but since you are getting your data in order, this is no problem
if you change while you read it (e.g. if you don't use a transaction), you will get the updated data for the yet unread rows. It will be a lot safer than your current method though: if you currently would change e.g. has_archive for the first row from 0 to 1 (after your read it), using offset will now include the first row too, and you will get some row twice.

You should test if using the primary key or has_archive,id will be faster. It will depend on the percentage of rows that have has_archive = 1, and with 20%, the primary key will probably beat the other index. Test it by forcing the index, e.g.

... FROM externallinks_global force index (primary) left join ...

or

... FROM externallinks_global force index (hasarchive) left join ...

If you test without using force, it will probably use the index hasarchive.

Sorry for the very late response. I implemented your suggestions, by redoing the queries and how the API paginates, and WOW what a difference. Thanks so much for your help. — Cyberpower678, Jul 19 '17 at 22:24

Accelerate large MySQL queries using LIMITs

1 Answers1