3

I am pulling my hair out trying to figure out what I am doing wrong. The table is very simple:

CREATE TABLE `icd_index` (
  `icd` char(5) NOT NULL,
  `core_id` int(11) NOT NULL,
  `dx_order` tinyint(4) NOT NULL,
  PRIMARY KEY (`icd`,`dx_order`,`core_id`),
  KEY `core` (`core_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

As you can see, I've created a covering index that spans all three columns of the table, plus an additional index on the core_id for potential joins. This is a one-to-many link table with each core_id mapped to one or more icd. The table contains 65 million rows.

So, here's the issue. Let's say I want to know how many people have an icd code of '25000'. [That's diabetes, in case you were wondering]. I write a query that looks like this:

SELECT COUNT(core_id) FROM icd_index WHERE icd='25000'

This takes over 60 seconds to execute. I had thought that since the icd column was first in the covered index, counting it would be fast.

More confusing, once I've run the query once, it now runs very quickly. I assumed that was because the query was cached, but even if I RESET QUERY CACHE, the query now runs in fractions of a second. But, if I wait long enough, it seems to slow down again -- and I can't figure out why.

I am missing something obvious. Do I need an index on icd alone? Is this the best performance I can get with 65M rows? Why does running the query then resetting the cache affect speed? Are the results being stored in the index?

EDIT: I'm running MySQL 5.6 (in case that matters).

Here's the EXPLAIN of the query:

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  icd_index   ref PRIMARY PRIMARY 15  const   910104  Using where; Using index
dened
  • 4,253
  • 18
  • 34
hpf
  • 428
  • 2
  • 9
  • Can you try `SELECT COUNT(*) FROM icd_index where icd = '25000'`? – N.B. Mar 08 '15 at 01:49
  • Same result. The first time I execute it, it's 70 seconds. If I re-execute it, it's instantaneous (presumably from the cache). If I reset the cache and run it, it's 0.7 seconds. I'm using MySQL 5.6 in case that matters. – hpf Mar 08 '15 at 01:59
  • Have you tried a different storage engine? MyISAM is pretty dated, I don't know if it can even utilize all the hardware goodies available these days. – N.B. Mar 08 '15 at 02:01
  • I initially started with InnoDB. But counting the whole table in InnoDB was painfully slow, so I switched to MyISAM. I don't need transactions as this DB is read only -- the data is fixed and never changes. But if switching to InnoDB would help, I'd consider (although would be hours of work to do so...) Plus I added the explain to the query, in case that's helpful – hpf Mar 08 '15 at 02:04
  • `InnoDB` has to be configured to be quick. You can choose how much RAM is going to be dedicated to InnoDB, and it uses it to store the data it works with there. 65m rows that are accessible from RAM rather than disk account for really, really fast operations. The magic variable is called `innodb_buffer_pool_size`. Alternatively, there's also `TokuDB`, another magnificent storage engine. Also, are you using a mechanical hard drive or an SSD? – N.B. Mar 08 '15 at 02:05
  • I don't know what's on the backend. I have a virtual server running windows 2012. But I have no idea what the hardware is -- so probably a mechanical drive. I wonder how much RAM the table would take? On disk, it's 3.4 GB – hpf Mar 08 '15 at 02:19
  • I only have 2GB, and that's probably the problem. I bet the table is moved to RAM, and then querying is really fast. But then something else pushes the table out, and it's slow again. Does that make sense? If so, sounds like I need to beg for more RAM – hpf Mar 08 '15 at 02:27
  • The query result is definitely getting cached. However, in the world of databases - we want to have quick access to data. That's why people use computers with lots of RAM and fast SSDs. Sadly, what's happening at your end is that the HDD is being accessed in order to build the data set, and the disk is rather slow. The way I'd go about it is getting proper hardware that can get the data back quickly (and most DBAs would do the same). – N.B. Mar 08 '15 at 02:37

3 Answers3

2

Here's what is going on.

The SELECT COUNT (...) icd_index where icd='25000'

will use the index, which is a BTree separate from the data. But it scans it in this way:

  1. Find the first entry having icd='25000'. This is almost instantaneous.
  2. Scan forward until if finds a change in icd. This will scan in the index only, not touching the data. According to the EXPLAIN, there will be about 910,104 index entries to scan over.

Now let's look at the BTree for that index. Based on the fields in the index, each row will be exactly 22 bytes, plus there will be some overhead (estimate 40%). A MyISAM index block is 1KB (cf InnoDB's 16KB). I would estimate 33 rows per block. 910,104/33 says about 27K blocks need to be read to do the COUNT. (Note COUNT(core_id) needs to check core_id for being null, COUNT(*) does not; this is a minor difference.) Reading 27K blocks on a plain hard drive takes about 270 seconds. You were lucky to get it done in 60 seconds.

The second run found all those blocks in the key_buffer (assuming key_buffer_size is at least 27MB), so it did not have to wait for the disk. Hence it was much faster. (This ignoring the Query cache, which you had the wisdom to flush or use SQL_NO_CACHE.)

5.6 happens to be irrelevant (but thanks for mentioning it), since this process has not changed since 4.0 or before (except that utf8 did not exist; more on that below).

Switching to InnoDB would help in a couple of ways. The PRIMARY KEY would be 'clustered' with the data, not stored as a separate BTree. Hence, once the data or the PK is cached, the other is immediately available. The number of blocks would be more like 5K, but they would be 16KB blocks. These might be loadable faster if the cache is cold.

You ask " Do I need an index on icd alone? " -- Well that would shrink the MyISAM BTree size to about 21 bytes per row, so the BTree would be about 21/27ths the size, not much improvement (at least for the cold-cache situation).

Another thought is, if icd is always numeric and always numeric, to use MEDIUMINT UNSIGNED, and tack on ZEROFILL if it can have leading zeros.

Oops, I failed to notice the CHARACTER SET. (I have fixed the numbers above, but let me elaborate.)

  • CHAR(5) allows for 5 characters.
  • ascii takes 1 byte per character.
  • utf8 takes up to 3 bytes per characters.
  • So, CHAR(5) CHARACTER SET utf8 takes 15 bytes always.

Changing the column to CHAR(5) CHARACTER SET ascii would shrink it to 5 bytes.

Changing it to MEDIUMINT UNSIGNED ZEROFILL would shrink it to 3 bytes.

Shrinking the data would speed up I/O by a roughly proportional amount (after allowing another 6 bytes for the other two fields.

Rick James
  • 135,179
  • 13
  • 127
  • 222
  • 65M rows --> 4GB of innodb_buffer_pool_size; do you have at least 6GB of RAM? – Rick James Mar 09 '15 at 04:43
  • Thanks for the great answer. I am getting a RAM upgrade shortly. The ICD needs to be CHAR since some codes start with a letter. But I could switch to ascii without a problem. I could also switch to InnoDB -- in fact I could recreate the table both ways and see which is faster. – hpf Mar 10 '15 at 12:33
  • Tips on [converting to InnoDB](http://mysql.rjweb.org/doc.php/myisam2innodb). If you are repeatedly repopulating the table, I can show you a way to do so with zero downtime. If your "working set" is less than the entire table, a RAM upgrade may not be necessary. However, for InnoDB, the buffer_pool should be set to about 70% of RAM (less than that if you have less than 4GB. – Rick James Mar 10 '15 at 17:08
  • Caution: the `ZEROFILL` attribute is deprecated. (There are workarounds if you need it.) – Rick James Feb 11 '22 at 18:31
0

Thanks to all above for your help. Given the above advice, I totally rebuilt the database like so:

  1. I convinced the server admin to increase my RAM to 6G.
  2. I switched all tables to InnoDB with an ASCII character set.
  3. When I moved the data from MyISAM to InnoDB, I sorted all of the data in the order of the covering index before inserting it into the new table, so the new table is completely sorted correctly. No idea if this really helps, but it seemed like it couldn't hurt.
  4. I tinkered with the DB settings, specifically the InnoDB Buffer Pool size and increased it to 256M.

Holy mother of God, it's really fast now. Simple count query above now runs in less than 2 seconds. Not sure which of the above was the most effective (but the query was fast before the buffer pool size increase)

hpf
  • 428
  • 2
  • 9
0

I had the same thing happening with one of my queries. A MyISAM table was using a filesort to do a simple SELECT statement.

I ended up switching to InnoDB and the problem disappeared. I don't know why.

Phil LaNasa
  • 2,907
  • 1
  • 25
  • 16