3

I have a biggish table of events. (5.3 million rows at the moment). I need to traverse this table mostly from the beginning to the end in a linear fashion. Mostly no random seeks. The data currently includes about 5 days of these events.

Due to the size of the table I need to paginate the results, and the internet tells me that "seek pagination" is the best method.

However this method works great and fast for traversing the first 3 days, after this mysql really begins to slow down. I've figured out it must be something io-bound as my cpu usage actually falls as the slowdown starts.

I do belive this has something to do with the 2-column sorting I do, and the usage of filesort, maybe Mysql needs to read all the rows to sort my results or something. Indexing correctly might be a proper fix, but I've yet been unable to find an index that solves my problem.

The compexifying part of this database is the fact that the ids and timestamps are NOT perfectly in order. The software requires the data to be ordered by timestamps. However when adding data to this database, some events are added 1 minute after they have actually happened, so the autoincremented ids are not in the chronological order.

As of now, the slowdown is so bad that my 5-day traversal never finishes. It just gets slower and slower...

I've tried indexing the table on multiple ways, but mysql does not seem to want to use those indexes and EXPLAIN keeps showing "filesort". Indexing is used on the where-statement though.

The workaround I'm currently using is to first do a full table traversal and load all the row ids and timestamps in memory. I sort the rows in the python side of the software and then load the full data in smaller chunks from mysql as I traverse (by ids only). This works fine, but is quite unefficient due to the total of 2 traversals of the same data.

The schema of the table:

CREATE TABLE `events` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `server` varchar(45) DEFAULT NULL,
  `software` varchar(45) DEFAULT NULL,
  `timestamp` bigint(20) DEFAULT NULL,
  `data` text,
  `event_type` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `index3` (`timestamp`,`server`,`software`,`id`),
  KEY `index_ts` (`timestamp`)
) ENGINE=InnoDB AUTO_INCREMENT=7410472 DEFAULT CHARSET=latin1;

The query (one possible line):

SELECT software,
       server,
       timestamp,
       id,
       event_type,
       data
FROM   events
WHERE  ( server = 'a58b'
         AND ( software IS NULL
                OR software IN ( 'ASD', 'WASD' ) ) )
       AND ( timestamp, id ) > ( 100, 100 )
       AND timestamp <= 200
ORDER  BY timestamp ASC,
          id ASC
LIMIT  100; 

The query is based on https://blog.jooq.org/2013/10/26/faster-sql-paging-with-jooq-using-the-seek-method/ (and some other postings with the same idea). I belive it is called "seek pagination with seek predicate". The basic gist is that I have a starting timestamp and ending timestamp, and I need to get all the events with the software on the servers I've specifed OR only the server-specific events (software = NULL). The weirdish ( )-stuff is due tho python constructing the queries based on the parameters it is given. I left them visible if by a small chance they might have some effect.

I'm excepting the traversal to finish before the heat death of the universe.

Madhur Bhaiya
  • 28,155
  • 10
  • 49
  • 57
Migzu
  • 43
  • 5
  • I have never seen this construct in MySQL: `(timestamp,id) > (100,100)`. I tested it, and it does something, but I cannot figure out what, nor can I find any documentation for it. Can anybody enlighten me? – KIKO Software Jul 14 '19 at 10:33
  • @KIKOSoftware It is called [Row Subqueries](https://dev.mysql.com/doc/refman/8.0/en/row-subqueries.html) or Tuple comparison. Check [this](https://stackoverflow.com/questions/5506729/sql-tuple-comparison) and [this](https://stackoverflow.com/questions/40489417/using-tuple-comparison-in-mysql-is-it-efficient/43457966) – Madhur Bhaiya Jul 14 '19 at 10:41
  • @MadhurBhaiya Thanks, yes, that seems to be it. Without knowing what it is called it is difficult to find it. I found that `(a, b) > (x, y)` is equivalent to `(a > x) OR ((a = x) AND (b > y))`. No wonder I couldn't figure it out by testing it. – KIKO Software Jul 14 '19 at 10:46
  • What is the cardinality of `server` and `software` (run: `select count(distinct server), count(distinct software) from events`) – Paul Spiegel Jul 14 '19 at 11:20
  • Please post TEXT results of A) SHOW INDEX FROM events; and B) EXPLAIN SELECT SQL_NO_CACHE software, server, timestamp, ...... and we will have the cardinality of each index and some idea what the OPTIMIZER might do with your query. – Wilson Hauck Jul 14 '19 at 12:04
  • Is there any reason you chose to define timestamp as BIGINT(20) rather than simply DATETIME? With a DATETIME data time, you could use WHERE timestamp BETWEEN $beg_timestamp AND $end_timestamp for simplicity. You should probably use a column name such as ev_datetime to avoid RESERVED WORD difficulties. – Wilson Hauck Jul 14 '19 at 12:23
  • @KIKOSoftware - The alternative (possibly better) reformulation is `a >= x AND (a > x OR b > y)` – Rick James Jul 14 '19 at 19:49

2 Answers2

2

First change

AND ( timestamp, id ) > ( 100, 100 )

to

AND (timestamp > 100 OR timestamp = 100 AND id > 100)

This optimisation is suggested in the official documentation: Row Constructor Expression Optimization

Now the engine will be able to use the index on (timestamp). Depending on cardinality of the columns server and software, that could be already fast enough.

An index on (server, timestamp, id) should improve the performance farther.

If still not fast enough, i would suggest a UNION optimization for

AND (software IS NULL OR software IN ('ASD', 'WASD'))

That would be:

(
    SELECT software, server, timestamp, id, event_type, data
    FROM events
    WHERE server = 'a58b'
      AND software IS NULL
      AND (timestamp > 100 OR timestamp = 100 AND id > 100)
      AND timestamp <= 200
    ORDER BY timestamp ASC, id ASC
    LIMIT 100
) UNION ALL (
    SELECT software, server, timestamp, id, event_type, data
    FROM events
    WHERE server = 'a58b'
      AND software = 'ASD'
      AND (timestamp > 100 OR timestamp = 100 AND id > 100)
      AND timestamp <= 200
    ORDER BY timestamp ASC, id ASC
    LIMIT 100
) UNION ALL (
    SELECT software, server, timestamp, id, event_type, data
    FROM events
    WHERE server = 'a58b'
      AND software = 'WASD'
      AND (timestamp > 100 OR timestamp = 100 AND id > 100)
      AND timestamp <= 200
    ORDER BY timestamp ASC, id ASC
    LIMIT 100
)
ORDER BY timestamp ASC, id ASC
LIMIT 100

You will need to create an index on (server, software, timestamp, id) for this query.

Paul Spiegel
  • 30,925
  • 5
  • 44
  • 53
  • According to me, this query will not work (give desired results) for paginated queries LIMIT(offset, count). – Rajat Goel Jul 14 '19 at 12:56
  • @RajatGoel This query will return the same result as the original one, which also doesn't use an offset. The main point is **not to use an OFFSET**, because it's inefficient for high offsets. – Paul Spiegel Jul 14 '19 at 13:11
  • As mentioned in the question, he is using paginated queries. And for pagination, offset is needed. – Rajat Goel Jul 14 '19 at 13:13
  • @RajatGoel There are different ways to implement pagination. Many of them avoid the use of `OFFSET` (mainly to improve the performance - but not only). Here is a [good example](https://mariadb.com/kb/en/library/pagination-optimization/). However - This discussion is pointless, since I'm not changing the result set. If you think the original query doesn't do what OP wants, then you can write a comment to the question. – Paul Spiegel Jul 14 '19 at 13:54
  • The manual page you linked to says "rewriting the row constructor expression using an equivalent nonconstructor expression may result in more complete index use" -- This has always been true. I just checked 8.0.15 and found that use of (a,b) > (3,4) to still be poorly optimized. True, the construct works. But it is not as fast as the equivalent combo of an `AND` and an `OR`. – Rick James Jul 14 '19 at 19:32
  • This union optimization seems to reduce query time by 90%, and I haven't even created the index yet. This is great! Thank you so much! – Migzu Jul 15 '19 at 23:23
0

There are multiple complications going on.

The quick fix is

INDEX(software, timestamp, id)   -- in this order

together with

    WHERE  server = 'a58b'
      AND  timestamp BETWEEN 100 AND 200
      AND ( software IS NULL
                OR software IN ( 'ASD', 'WASD' ) ) )
      AND ( timestamp, id ) > ( 100, 100 )
    ORDER  BY timestamp ASC,
              id ASC
    LIMIT  100; 

Note that server needs to be first in the index, not after the thing you are doing a range on (timestamp). Also, I broke out timestamp BETWEEN ... to make it clear to the optimizer that the next column of the ORDER BY might make use of the index.

You said "pagination", so I assume you have an OFFSET, too? Add it back in so we can discuss the implications. My blog on "remembering where you left off" instead of using OFFSET may (or may not) be practical.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • I've actually read your blog on pagination before. The pagination part of my query is "AND ( timestamp, id ) > ( 100, 100 )". I just up the timestamp and ID based on the last result of the last query. – Migzu Jul 14 '19 at 19:56
  • Use of row constructors should be changed to the AND-OR equivalent until (if) it gets optimized. – Rick James Jul 15 '19 at 04:44