60 million entries, select entries from a certain month. How to optimize database?

Question

I have a database with 60 million entries.

Every entry contains:

ID
DataSourceID
Some Data
DateTime

I need to select entries from certain month. Each month contains approximately 2 million entries.
```
 select * 
   from Entries 
  where time between "2010-04-01 00:00:00" and "2010-05-01 00:00:00"
```
(query takes approximately 1.5 minutes)
I'd also like to select data from certain month from a given DataSourceID. (takes approximately 20 seconds)

There are about 50-100 different DataSourceIDs.

Is there a way to make this faster? What are my options? How to optimize this database/query?

EDIT: There's approx. 60-100 inserts PER second!

I have no indexing at the moment. Table is just sorted by ID. — JBeurer, Mar 27 '11 at 18:17

OMG Ponies · Answer 1 · 2011-03-27T18:21:53.733

7

To get entries in a particular month, for a particular year, faster - you will need to index the time column:

CREATE INDEX idx_time ON ENTRIES(time) USING BTREE;

Additionally, use:

SELECT e.* 
  FROM ENTRIES e
 WHERE e.time BETWEEN '2010-04-01' AND DATE_SUB('2010-05-01' INTERVAL 1 SECOND)

...because BETWEEN is inclusive, so you'd get anything dated "2010-05-01 00:00:00" with the query you posted.

I'd also like to select data from certain month from a given DataSourceID

You can either add a separate index for the datasourceid column:

CREATE INDEX idx_time ON ENTRIES(datasourceid) USING BTREE;

...or setup a covering index to include both columns:

CREATE INDEX idx_time ON ENTRIES(time, datasourceid) USING BTREE;

A covering index requires that the leftmost columns have to be used in the query for the index to be used. In this example, having time first will work for both situations you mentioned -- datasourceid doesn't have to be used for the index to be of use. But, you have to test your queries by viewing the EXPLAIN output to really know what works best for your data & the queries being performed on that data.

That said, indexes will slow down INSERT, UPDATE and DELETE statements. And an index doesn't provide a lot of value if the column data is has few distinct values - IE: a boolean column is a bad choice to index, because the cardinality is low.

edited Mar 27 '11 at 18:21

answered Mar 27 '11 at 18:16

OMG Ponies

325,700
82
523
502

A query that asks for 2 million rows typically prefers a table scan over an index range seek, so I don't think MySQL would use the index `(time)`. Two million table lookups are way too expensive. The index would have to be covering. – Andomar Mar 27 '11 at 18:19
@Andomar: True, but I like to give the options & tools to review so the people learn. We've seen people hide details too. – OMG Ponies Mar 27 '11 at 18:23
Will adding index on time, datasourceid significantly slow down adding to the ENTRIES table? Because that happens quite frequently. – JBeurer Mar 27 '11 at 18:26
@JBeurer: As stated in the last paragraph, yes. Everything will have tradeoffs, you have to decide what matters more and possibly restructure to support both better. – OMG Ponies Mar 27 '11 at 18:27
Wow. Adding index on time brought down the query time from 1.5 mins to 3 seconds. Now if only this doesn't slow down inserts significantly. – JBeurer Mar 27 '11 at 19:03

Jon Black · Accepted Answer · 2011-03-27T21:04:23.510

Take advantage of innodb clustered primary key indexes.

http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html

This will be extremely performant:

create table datasources
(
year_id smallint unsigned not null,
month_id tinyint unsigned not null,
datasource_id tinyint unsigned not null,
id int unsigned not null, -- needed for uniqueness
data int unsigned not null default 0,
primary key (year_id, month_id, datasource_id, id)
)
engine=innodb;

select * from datasources where year_id = 2011 and month_id between 1 and 3;

select * from datasources where year_id = 2011 and month_id = 4 and datasouce_id = 100;

-- etc..

EDIT 2

Forgot i was running the first test script with 3 months of data. Here's the results for a single month : 0.34 and 0.69 seconds.

select d.* from datasources d where d.year_id = 2010 and d.month_id = 3 and datasource_id = 100 order by d.id desc limit 10;
+---------+----------+---------------+---------+-------+
| year_id | month_id | datasource_id | id      | data  |
+---------+----------+---------------+---------+-------+
|    2010 |        3 |           100 | 3290330 | 38434 |
|    2010 |        3 |           100 | 3290329 |  9988 |
|    2010 |        3 |           100 | 3290328 | 25680 |
|    2010 |        3 |           100 | 3290327 | 17627 |
|    2010 |        3 |           100 | 3290326 | 64508 |
|    2010 |        3 |           100 | 3290325 | 14257 |
|    2010 |        3 |           100 | 3290324 | 45950 |
|    2010 |        3 |           100 | 3290323 | 49986 |
|    2010 |        3 |           100 | 3290322 |  2459 |
|    2010 |        3 |           100 | 3290321 | 52971 |
+---------+----------+---------------+---------+-------+
10 rows in set (0.34 sec)

select d.* from datasources d where d.year_id = 2010 and d.month_id = 3 order by d.id desc limit 10;
+---------+----------+---------------+---------+-------+
| year_id | month_id | datasource_id | id      | data  |
+---------+----------+---------------+---------+-------+
|    2010 |        3 |           116 | 3450346 | 42455 |
|    2010 |        3 |           116 | 3450345 | 64039 |
|    2010 |        3 |           116 | 3450344 | 27046 |
|    2010 |        3 |           116 | 3450343 | 23730 |
|    2010 |        3 |           116 | 3450342 | 52380 |
|    2010 |        3 |           116 | 3450341 | 35700 |
|    2010 |        3 |           116 | 3450340 | 20195 |
|    2010 |        3 |           116 | 3450339 | 21758 |
|    2010 |        3 |           116 | 3450338 | 51378 |
|    2010 |        3 |           116 | 3450337 | 34687 |
+---------+----------+---------------+---------+-------+
10 rows in set (0.69 sec)

EDIT 1

Decided to test the above schema with approx. 60 million rows spread over 3 years. Each query is run cold i.e. each run separately after which mysql is restarted clearing any buffers and with no query caching.

The full test script can be found here : http://pastie.org/1723506 or below...

As you can see it's a pretty performant schema even on my humble desktop :)

select count(*) from datasources;
+----------+
| count(*) |
+----------+
| 60306030 |
+----------+

select count(*) from datasources where year_id = 2010;
+----------+
| count(*) |
+----------+
| 16691669 |
+----------+

select
 year_id, month_id, count(*) as counter
from
 datasources
where 
 year_id = 2010
group by
 year_id, month_id;
+---------+----------+---------+
| year_id | month_id | counter |
+---------+----------+---------+
|    2010 |        1 | 1080108 |
|    2010 |        2 | 1210121 |
|    2010 |        3 | 1160116 |
|    2010 |        4 | 1300130 |
|    2010 |        5 | 1860186 |
|    2010 |        6 | 1220122 |
|    2010 |        7 | 1250125 |
|    2010 |        8 | 1460146 |
|    2010 |        9 | 1730173 |
|    2010 |       10 | 1490149 |
|    2010 |       11 | 1570157 |
|    2010 |       12 | 1360136 |
+---------+----------+---------+
12 rows in set (5.92 sec)


select 
 count(*) as counter
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3 and datasource_id = 100;

+---------+
| counter |
+---------+
|   30003 |
+---------+
1 row in set (1.04 sec)

explain
select 
 d.* 
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3 and datasource_id = 100
order by
 d.id desc limit 10;

+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  |rows    | Extra                       |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
|  1 | SIMPLE      | d     | range | PRIMARY       | PRIMARY | 4       | NULL |4451372 | Using where; Using filesort |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
1 row in set (0.00 sec)


select 
 d.* 
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3 and datasource_id = 100
order by
 d.id desc limit 10;

+---------+----------+---------------+---------+-------+
| year_id | month_id | datasource_id | id      | data  |
+---------+----------+---------------+---------+-------+
|    2010 |        3 |           100 | 3290330 | 38434 |
|    2010 |        3 |           100 | 3290329 |  9988 |
|    2010 |        3 |           100 | 3290328 | 25680 |
|    2010 |        3 |           100 | 3290327 | 17627 |
|    2010 |        3 |           100 | 3290326 | 64508 |
|    2010 |        3 |           100 | 3290325 | 14257 |
|    2010 |        3 |           100 | 3290324 | 45950 |
|    2010 |        3 |           100 | 3290323 | 49986 |
|    2010 |        3 |           100 | 3290322 |  2459 |
|    2010 |        3 |           100 | 3290321 | 52971 |
+---------+----------+---------------+---------+-------+
10 rows in set (0.98 sec)


select 
 count(*) as counter
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3;

+---------+
| counter |
+---------+
| 3450345 |
+---------+
1 row in set (1.64 sec)

explain
select 
 d.* 
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3
order by
 d.id desc limit 10;

+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  |rows    | Extra                       |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
|  1 | SIMPLE      | d     | range | PRIMARY       | PRIMARY | 3       | NULL |6566916 | Using where; Using filesort |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+-----------------------------+
1 row in set (0.00 sec)


select 
 d.* 
from 
 datasources d
where 
 d.year_id = 2010 and d.month_id between 1 and 3
order by
 d.id desc limit 10;

+---------+----------+---------------+---------+-------+
| year_id | month_id | datasource_id | id      | data  |
+---------+----------+---------------+---------+-------+
|    2010 |        3 |           116 | 3450346 | 42455 |
|    2010 |        3 |           116 | 3450345 | 64039 |
|    2010 |        3 |           116 | 3450344 | 27046 |
|    2010 |        3 |           116 | 3450343 | 23730 |
|    2010 |        3 |           116 | 3450342 | 52380 |
|    2010 |        3 |           116 | 3450341 | 35700 |
|    2010 |        3 |           116 | 3450340 | 20195 |
|    2010 |        3 |           116 | 3450339 | 21758 |
|    2010 |        3 |           116 | 3450338 | 51378 |
|    2010 |        3 |           116 | 3450337 | 34687 |
+---------+----------+---------------+---------+-------+
10 rows in set (1.98 sec)

Hope this helps :)

How this will affect the performance of inserts? Because in the database of 60 million entries, and 100 inserts per second, that could become an issue, no? — JBeurer, Mar 27 '11 at 18:48
inserts shouldnt be a problem as you're probably inserting in clustered PK order which is performant even for innodb :P — Jon Black, Mar 27 '11 at 18:49
If rows aren't updated or deleted, would the data naturally get clustered around his "time" column as a side effect of inserting the rows in order "as time goes by" if you know what I mean? The underlying mechanism for allocating blocks will determine this, but I don't know how MySQL does it. Do you know how it works? What I am getting at is that using an InnoDB table with an auto incrementing primary key, might provide the same clustering benefit but without the fat key. — Ronnis, Mar 27 '11 at 19:40
if he's using myisam and inserting in time sequence then new data is appended to the end of the file. However, myisam doesnt support clustered primary keys which means lots more I/O than the equivalent innodb schema. — Jon Black, Mar 27 '11 at 20:28
@f00, I meant to use InnoDB clustered on the ID. Higher values for ID should have higher values for TIME, producing pretty much the same physical order of rows regardless of whether the table is clustered by ID or TIME. I will experiment with it later today. Anyway, +1 for providing examples with actual measures! — Ronnis, Mar 28 '11 at 08:38
Clustering alters the data block into a certain distinct order to match the index, resulting in the row data being stored in order. Therefore, only one clustered index can be created on a given database table. Clustered indices can greatly increase overall speed of retrieval, but usually only where the data is accessed sequentially in the same or reverse order of the clustered index, or when a range of items is selected. You dont seem to select by ID but by range i.e where month between 1 and 3 and datasource_id in (10,20,30) — Jon Black, Mar 28 '11 at 08:50

Andomar · Answer 3 · 2011-03-27T18:34:01.220

You could use an index to trade disk usage for query speed. An index that starts the time column can speed up queries that ask for a particular month:

create index IX_YourTable_Date on YourTable (time, DataSourceID, ID, SomeData)

Because the index starts with the time field, MySQL can do a key range scan on the index. That should be as fast as it gets. The index should include all columns in the query, or MySQL would have to look from the index to the table data for each row. Since you're asking for 2 million rows, MySQL will likely ignore an index that is not covering. (Covering index = index that includes all rows in the query.)

If you never query on ID, you can redefine the table to use (time, DataSourceID, ID) as primary key:

alter table YourTable add primary key (time, DataSourceID, ID)

This will speed up searches on time at no cost in disk space, but searches on ID will be very slow.

score 1 · Answer 4 · answered Mar 27 '11 at 18:17

1

I would try putting an index if you haven't already on the time field.

For DataSourceID, you could try using Enum instead of varchar/int.

answered Mar 27 '11 at 18:17

Michael

1,816
7
21
35

60 million entries, select entries from a certain month. How to optimize database?

4 Answers4

I'd also like to select data from certain month from a given DataSourceID

Linked