3

I would appreciate if someone could explain how is it possible MySQL is not churning with a large table on default config.

note: I don't need advice how to increase the memory, improve the performance or migrate etc. I want to understand why it is working and performing well.

I have the following table:

CREATE TABLE `daily_reads` (
  `a` varchar(32) NOT NULL DEFAULT '',
  `b` varchar(50) NOT NULL DEFAULT '',
  `c` varchar(20) NOT NULL DEFAULT '',
  `d` varchar(20) NOT NULL DEFAULT '',
  `e` varchar(20) NOT NULL DEFAULT '',
  `f` varchar(10) NOT NULL DEFAULT 'Wh',
  `g` datetime NOT NULL,
  `PERIOD_START` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `i` decimal(16,3) NOT NULL,
  `j` decimal(16,3) NOT NULL DEFAULT '0.000',
  `k` decimal(16,2) NOT NULL DEFAULT '0.00',
  `l` varchar(1) NOT NULL DEFAULT 'N',
  `m` varchar(1) NOT NULL DEFAULT 'N',
  PRIMARY KEY (`a`,`b`,`c`,`PERIOD_START`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

It is running on a VM with 1 CPU Core, 6GB RAM, CentOS 7 (have very limited access to that VM).

It is running on a default MySQL config with 128MB buffer pool (SELECT @@innodb_buffer_pool_size/1024/1024)

DB size is ~96GB, ~560M rows in the 'reads' table, ~710M rows with other tables.

select database_name, table_name, index_name, stat_value*@@innodb_page_size
from mysql.innodb_index_stats where stat_name='size';

PRIMARY: 83,213,500,416 (no other indexes)

I get like ~500K reads/month and writes are done only as part of an ETL process directly from Informatica to the DB (~ 75M writes/month).

The read queries are called only via stored procedure:

CALL sp_get_meter_data('678912345678', '1234567765432', '2017-01-13 00:00:00', '2017-05-20 00:00:00');

// striped out the not important bits:
...
SET daily_from_date = DATE_FORMAT(FROM_DATE_TIME, '%Y-%m-%d 00:00:00');
SET daily_to_date = DATE_FORMAT(TO_DATE_TIME, '%Y-%m-%d 23:59:59');
...
SELECT
    *
FROM
    daily_reads
WHERE
    A = FRIST_NUMBER
    AND
    B = SECOND_NUMBER
    AND
    daily_from_date <= PERIOD_START
    AND
    daily_to_date >= PERIOD_START
ORDER BY
    PERIOD_START ASC;

My understanding of InnoDB is quite limited, but I thought I need to fit all indexes into memory to do fast queries. The read procedure takes only a few milliseconds. I thought it is not technically possible to query 500M+ tables fast enough on a default MySQL config...?

What am I missing?

note: I don't need advice how to increase the memory, improve the performance or migrate etc. I want to understand why it is working and performing well.

Kuko Kukanovic
  • 111
  • 1
  • 5
  • databases from beginning was quite good optimised for size >> RAM, this is nature of database. many, many sophisticated algoritms and data structures are used and works well. Modern thinking "database in RAM" is a niche, not main stream – Jacek Cz Aug 03 '17 at 13:44
  • example: heavily used b-tress, r-trees and so on. Typical 'find' or 'next' operation may get 1-5 disk operation and have result in 95% of opertion. Remaining 5% is slower – Jacek Cz Aug 03 '17 at 13:45
  • See the https://stackoverflow.com/a/1592279/2935802 – Sudhakar Aug 03 '17 at 13:52
  • this is not explaining why this is working and also I'm not looking how to improve the performance as the queries are executed within 20-100ms, as well, b-tree index is kept in memory (and I have only 128mb, not 80gb), so I still don't see how this is possible – Kuko Kukanovic Aug 03 '17 at 13:56
  • Why you don't have any index on 'PERIOD_START' ? why do you not add a new column with PERIOD_START_DATE type Date and add an index? your query will be more efficient. – Linefinc Aug 03 '17 at 14:12
  • good catch Linefinc, that was a typo on my end, fixed that... – Kuko Kukanovic Aug 03 '17 at 14:17
  • i think this database use [Partition](https://dev.mysql.com/doc/refman/5.7/en/alter-table-partition-operations.html) , same type of performance is achieved by other persons [Check this answer](https://stackoverflow.com/questions/4775527/mysql-improve-select-speed?rq=1) . Out of curiosity can you check is any partition is done in DB [how to check](https://dev.mysql.com/doc/refman/5.5/en/partitioning-info.html) – Jophy job Aug 29 '17 at 17:28
  • @Jophyjob nope, no partitioning – Kuko Kukanovic Oct 01 '17 at 20:23

4 Answers4

1

Long answer: Your primary key is a composite of several columns starting with a and b.

Your WHERE clause says this.

 WHERE a = FRIST_NUMBER
   AND b = SECOND_NUMBER
   AND etc etc.

This WHERE clauses exploits the index associated with your primary key very efficiently indeed. It random-accesses the index to precisely the first row it needs, and then scans it sequentially. So it doesn't actually have to page in much of your index or your table to satisfy your query.

Short answer: When queries exploit indexes, MySQL is fast and cheap.

If you wanted an index that was perfect for this query, it would be a composite index on (a, b, daily_from_date). This would use equality matching to hit the first matching row in the index, then range scan the index for your chosen date range. But the performance you have now is pretty good.

You asked whether the index must fit entirely in memory. No. The entire purpose of DBMS software is to handle volumes of data that can't possibly fit in memory at once. Good DBMS implementations do a good job of maintaining memory caches, and refreshing those caches from mass storage, when needed. The innodb buffer pool is one such cache. Keep in mind that any insertions or updates to a table require both the table data and the index data to be written to mass storage eventually.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • Thanks, the long the better ;) Doesn't the index have to fit in memory? Because now I have to do a file scan on each request, right? – Kuko Kukanovic Aug 03 '17 at 14:54
0

The performances can be improved with some index.

In your specific case, you are filtering on 3 columns: A, B, and PERIOD_START. To speed up the query you can use index on this columns.

Add an index over PERIOD_START can be inefficient because this type stores TIME information, so you have a lot of differnt values in the same day.

You can add a new column to store the DATE part of PERIOD_START in the correct type (DATE) (something like PERIOD_START_DATE) and add an index on this column.

This makes a more effective indexing and this can improve the computation performance because you are using a look up table (key -> values).

If you do not want to change your client code, you can use a "Generated stored column". See MySql manual

Best regards

Linefinc
  • 375
  • 2
  • 9
  • that was not the question, please read the note at the bottom – Kuko Kukanovic Aug 03 '17 at 17:35
  • The db has a query cache and it accelerate the query that are requested with more frequency. I saw a 1sec query to be fetch in less of 0.05sec. The cache works well but if you have a lots of cache missing, the db must perform full query. From my understanding, you have different issues , big table with a lots of scan time and difficulty to use query cache. The answer can be ... No with standard setup but nobody can avoid to make your own cache with a small table. Obviously if you have physical access to the machine you probably can make some trick – Linefinc Aug 03 '17 at 18:08
  • all the queries are executed in 17-20ms, there is no issue with the performance, as I tried to explain in the question, I don't understand why this is working if there is no cache and the queries are random queries – Kuko Kukanovic Aug 03 '17 at 20:21
0

its possible your index is getting used (probably not given the leading edge doesnt match the columns in your query) but even if it isn't, you'd only ever read through the table once because the query doesn't have any joins and the subsequent runs would pick the cached results.

Since You're using informatica to load the data (its a swiss army knife of data loading) it may be doing a lot more than you realise e.g. assuming the data load is all inserts then it may drop and recreate indexes and run in bulk mode to load the data really quickly. It may even prerun the query to prime your cache with the first post load run.

Daniel Machet
  • 615
  • 1
  • 5
  • 7
0

Doesn't the index have to fit in memory?

No, the entire index does not have to fit in memory. Only the part of the index that needs to be examined during the query execution.

Since you have conditions on the left-most columns of your primary key (which is you clustered index), the query only examines rows that match the values you search for. The rest of the table is not examined at all.

You can try using EXPLAIN with your query and see an estimate of the number of rows examined. This is only a rough estimate calculated by the optimizer, but it should show that your query only needs to examine a small subset of the 550 million rows.

The InnoDB buffer pool keeps copies of frequently-used pages in RAM. The more frequently a page is used, the more likely it is to stay in the buffer pool and not get kicked out. Over time, as you run queries, your buffer pool gradually stabilizes with the set of pages that is most worth keeping in RAM.

If your query workload were to really scan your entire table frequently, then the small buffer pool would churn a lot more. But it's likely that your queries request the same small subset of the table repeatedly. A phenomenon called the Pareto Principle applies in many real-world applications: the majority of the requests are satisfied by a small minority of data.

This principle tends to fail when we run complex analytical queries, because those queries are more likely to scan the entire table.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828