3

My InnoDB table has about 1 billion log entries. I am trying to do a simple select on the table resulting in about 30 million rows. The query consists of a range condition on an int field holding a timestamp.

Query: SELECT * FROM logs WHERE created_at >= 1446422400 AND created_at <= 1447027199

This query has a poor performance, running about 6-7 hours and only getting about 1.000 rows/sec. created_at has an index on it and the query is using the index. When I do a simple SELECT * FROM logs WHERE id >= xx AND id <= yy, setting xx and yy so that the result set is nearly the same (30 million rows), performance is great, running max. 10 minutes.

This really bugs me out. Why does the range on PK work so great and on the index so bad? How can I optimize the index? I've recreated the table few days ago, so indices should be fine.

Some more info:

Table scheme:

| logs | CREATE TABLE `logs` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `categoryid` varchar(15) NOT NULL,
  `type` varchar(15) NOT NULL,
  `text` varchar(500) NOT NULL,
  `created_at` int(7) NOT NULL,
  `status` varchar(45) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `status_categoryid_type` (`status`,`categoryid`,`type`),
  KEY `created_at` (`created_at`),
) ENGINE=InnoDB AUTO_INCREMENT=1335078012 DEFAULT CHARSET=latin1 |

Explain:

| id   | select_type | table   | type  | possible_keys | key        | key_len | ref  | rows     | Extra                 |
+------+-------------+---------+-------+----------------------------+---------+------+----------+-----------------------+
|    1 | SIMPLE      | logs    | range | created_at    | created_at | 4       | NULL | 31707348 | Using index condition |

Goal:

My goal is to retrieve data from the table in weekly-chunks. The range above specifies a week range. Later I want to query the whole table in batches, getting always the data of a specific week.

karot
  • 63
  • 7
  • 1
    have a read of the following answer then rethink your one and only clustered primary key: http://stackoverflow.com/questions/4419499/mysql-and-nosql-help-me-to-choose-the-right-one/4421601#4421601 – Jon Black Nov 22 '15 at 17:00
  • In a similar scenario I solved this performance issue by adding additional (redundant) columns to the table. In your case this might be columns like "created_year" and "created_month". These may be populated by calculating the corresponding year and month values once from your created_at column. Using those columns as index columns may significantly increase performance if you adopt your sql queries to use these cols as pre filters. – maxhb Nov 22 '15 at 17:22
  • Try query index hints trick and me share results – Muhammad Muazzam Nov 22 '15 at 19:12
  • Let's dig deeper -- What are you going to do with the 30M rows? Is it something that can be done in SQL? If so, maybe we can optimize _that_, and not worry about fetching them out of the server. – Rick James Nov 27 '15 at 23:51

1 Answers1

0

The reason why the query by id works so fast is most probably because mysql has created a clustered index for id, because it is the primary key, while the created_at index is not clustered, because created_at is not the primary key.

I am not sure why the difference is so huge, since the created_at dates are most probably just as sequential as the ids are, but apparently it is.

So, try this:

SELECT id FROM logs WHERE created_at >= 1446422400 LIMIT 1 

(assign result to id1)

SELECT id FROM logs WHERE created_at <= 1447027199 ORDER BY id DESC LIMIT 1 

(assign result to id2)

SELECT * FROM logs WHERE id >= id1 AND id <= id2
Mike Nakis
  • 56,297
  • 11
  • 110
  • 142
  • Thanks for your answer. In fact, `created_at` is a little bit misleading in my case, because it is not only the date when the record is created, but also when it got the last time updated.. This causes the data to be "fragmented" and not in a batch. Also because of this your suggestion does not work for me. :( Can I tell MySQL to cluster the key `created_at` too? Or is this a bad way? – karot Nov 22 '15 at 17:11
  • The way I understand clustered indexes, what is so special about them is that the order of the keys in the index perfectly reflects the order of the rows in the table. As a result, there can only be one clustered index. And this in turn means that there is no solution for you. As a matter of fact, if you have a billion rows, and you are running a query on an arbitrary indexed value, and you are receiving 1000 rows per second, you should consider yourself very lucky. That's the kind of performance you should expect. – Mike Nakis Nov 22 '15 at 17:18
  • Wow, okay. In fact 1000 rows / sec is really not enough for my use case. Could partitioning the table give any performance improvements? – karot Nov 22 '15 at 17:20
  • No, table partionining required if primary key also not helping – Muhammad Muazzam Nov 22 '15 at 19:08