Handling huge MyISAM table for optimisation

Question

I have a huge (and growing) MyISAM table (700millions rows = 140Gb).

CREATE TABLE `keypairs` (
  `ID` char(60) NOT NULL,
  `pair` char(60) NOT NULL,
  PRIMARY KEY (`ID`)
) ENGINE=MyISAM

The table option was changed to ROW_FORMAT=FIXED, cause both columns are always fixed length to max (60). And yes yes, ID is well a string sadly and not an INT.

SELECT queries are pretty ok in speed efficiency.

Databases and mysql engine are all 127.0.0.1/localhost. (nothing distant)

Sadly, INSERT is slow as hell. I dont even talk about trying to LOAD DATA millions new rows... takes days.

There won't have any concurrent read on it. All SELECTs are done one by one by only my local server.(it is not for client's use)

(for infos : files sizes .MYD=88Gb, .MYI=53Gb, .TMM=400Mb)

How could i speed up inserts into that table?
Would it help to PARTITION that huge table ? (how then?)
I heard MyISAM is using "structure cache" as .frm files. And that a line into config file is helping mysql keep in memory all the .frm (in case of partitionned), would it help also? Actualy, my .frm file is 9kb only for 700millions rows)
string shortenning/compress function... the ID string? (same idea as rainbow tables) even if it lowers the max allowed unique ID's, i will anyway never reach the max of 60chars. so maybe its an idea? but before creating a new unique ID i have to check if shortened string doesn't exists in db ofc
Same idea as shortening ID strings, what about using md5() on the ID? shorten string means faster or not in that case?

Ignoring the fact that the column `ID` is the `PK` of the table, is the index on `ID` required by the application? I'm thinking on dropping it and adding a new unsigned big int auto incremented column as `PK`. — axiac, Jan 30 '17 at 20:35
Sadly yes, as the column name says its the unique IDentificator that allow me to get the `pair` data... i can't change it to INT only... At least, the ID field as UNIQUE is required, and i need it to match the data in SELECT queries :/ — user3916429, Jan 30 '17 at 21:04
The `.frm` file contains the [table definition](https://dev.mysql.com/doc/internals/en/frm-file-format.html), not data. That's why its size is only 9kb and it doesn't grow. Never heard of `.tmm` files but a quick search on the Internet leads to an [error message produced by MySQL](https://bugs.mysql.com/bug.php?id=54828) which says that they are temporary files and can be safely removed. — axiac, Jan 30 '17 at 21:05
Well, you need an unique field to identify the rows and return the values of the `pair` field but is `ID` that field or a new int field could do the job? Does the code receive the values of `ID` from an external source when it issues a `SELECT` against the table? F.e., is it `PK` in another table? Or the values of `ID` are returned by an API to a different system and later, that system send the IDs in other requests to get the rest of the data (column `pair`)? — axiac, Jan 30 '17 at 21:10
i have no other way to get the `pair data` than identifying the row using the ID STRING. if i change it by an INT, i would need a second table as `ID STRING`, `ID INT` to be able to link, so same problem :/ thats why i though of String Shortener functions that takes some chars at fixed position to recreate a shorter string finaly... or maybe should i hash that string for shorter string also? shorter ID string would maybe mean faster? — user3916429, Jan 30 '17 at 21:27
What do the strings look like? Part of the solution I will propose involves shrinking the disk footprint. — Rick James, Jan 31 '17 at 05:36
With one exception, the size of the column(s) in an index directly impacts the size of the index's BTree, hence the cacheability, hence I/O, hence speed. The exception: InnoDB's `PRIMARY KEY`, which is clustered with the data. — Rick James, Jan 31 '17 at 06:18

Rick James · Accepted Answer · 2017-02-01T00:56:21.290

Sort the incoming data before doing the LOAD. This will improve the cacheability of the PRIMARY KEY(id).
PARTITIONing is unlikely to help, unless there is some useful pattern to ID.
PARTITIONing will not help for single-row insert nor for single-row fetch by ID.
If the strings are not a constant width of 60, you are wasting space and speed by saying CHAR instead of VARCHAR. Change that.
MyISAM's FIXED is useful only if there is a lot of 'churn' (deletes+inserts, and/or updates).
Smaller means more cacheable means less I/O means faster.
The .frm is an encoding of the CREATE TABLE; it is not relevant for this discussion.
A simple compress/zip/whatever will almost always compress text strings longer than 10 characters. And they can be uncompressed, losslessly. What do your strings look like? 60-character English text will shrink to 20-25 bytes.
MD5 is a "digest", not a "compression". You cannot recover the string from its MD5. Anyway, it would take 16 bytes after converting to BINARY(16).
The PRIMARY KEY is a BTree. If ID is somewhat "random", then the 'next' ID (unless the input is sorted) is likely not to be cached. No, the BTree is not rebalanced all the time.
Turning the PRIMARY KEY into a secondary key (after adding an AUTO_INCREMENT) will not speed things up -- it still has to update the BTree with ID in it!
How much RAM do you have? For your situation, and for this LOAD, set MyISAM's key_buffer_size to about 70% of available RAM, but not bigger than the .MYI file. I recommend a big key_buffer because that is where the random accesses are occurring; the .MYD is only being appended to (assuming you have never deleted any rows).
We do need to see your SELECTs to make sure these changes are not destroying performance somewhere else.
Make sure you are using CHARACTER SET latin1 or ascii; utf8 would waste a lot more space with CHAR.

Switching to InnoDB will double, maybe triple, the disk space for the table (data+index). Therefore, it will probably show down. But a mitigating factor is that the PK is "clustered" with the data, so you are not updating two things for each row inserted. Note that key_buffer_size should be lowered to 10M and innodb_buffer_pool_size should be set to 70% of available RAM.

(My bullet items apply to InnoDB except where MyISAM is specified.)

In using InnoDB, it would be good to try to insert 1000 rows per transaction. Less than that leads to more transaction overhead; more than that leads to overrunning the undo log, causing a different form of slowdown.

Hex ID

Since ID is always 60 hex digits, declare it to be BINARY(30) and pack them via UNHEX(...) and fetch via HEX(ID). Test via WHERE ID = UNHEX(...). That will shrink the data about 25%, and MyISAM's PK by about 40%. (25% overall for InnoDB.)

To do just the conversion to BINARY(30):

CREATE TABLE new (
    ID BINARY(30) NOT NULL,
    `pair` char(60) NOT NULL
    -- adding the PK later is faster for MyISAM
) ENGINE=MyISAM;
INSERT INTO new
    SELECT UNHEX(ID),
           pair
        FROM keypairs;
ALTER TABLE keypairs ADD
    PRIMARY KEY (`ID`);   -- For InnoDB, I would do differently
RENAME TABLE keypairs TO old,
             new TO keypairs;
DROP TABLE old;

Tiny RAM

With only 2GB of RAM, a MyISAM-only dataset should use something like key_buffer_size=300M and innodb_buffer_pool_size=0. For InnoDB-only: key_buffer_size=10M and innodb_buffer_pool_size=500M. Since ID is probably some kind of digest, it will be very random. The small cache and the random key combine to mean that virtually every insert will involve a disk I/O. My first estimate would be more like 30 hours to insert 10M rows. What kind of drives do you have? SSDs would make a big difference if you don't already have such.

The other thing to do to speed up the INSERTs is to sort by ID before starting the LOAD. But that gets tricky with the UNHEX. Here's what I recommend.

Create a MyISAM table, tmp, with ID BINARY(30) and pair, but no indexes. (Don't worry about key_buffer_size; it won't be used.)
LOAD the data into tmp.
ALTER TABLE tmp ORDER BY ID; This will sort the table. There is still no index. I think, without proof, that this will be a filesort, which is much faster that "repair by key buffer" for this case.
INSERT INTO keypairs SELECT * FROM tmp; This will maximize the caching by feeding rows to keypairs in ID order.

Again, I have carefully spelled out things so that it works well regardless of which Engine keypairs is. I expect step 3 or 4 to take the longest, but I don't know which.

My requests are simples `SELECT pair FROM keypairs WHERE ID=string LIMIT 1;` never anything else. My ID strings are already some hashes kind containing only [0-9a-f] chars only BUT are ALWAYS 60 length — user3916429, Jan 31 '17 at 06:15
is there a command i can use straight from mysql command line to export my current table to a new created one, while converting CHAR(60) field to BIN(30) at the same time? — user3916429, Jan 31 '17 at 06:26
added code for doing just ID. If pair needs it too, do it at the same time. — Rick James, Jan 31 '17 at 06:34
So, for news : I dont know yet for the speed. But at least for SIZE... from 140Gb to 60Gb total. Nice Optimization at least on that! — user3916429, Jan 31 '17 at 17:06
For next readers, if that can help. Here is how i now handle `LOAD DATA INFILE` using the `UNHEX()` since my .csv file is containing strings (formated as `"stringID";"stringPAIR"\r\n`). `LOAD DATA INFILE 'G:\\sorted.csv' IGNORE INTO TABLE keypairs FIELDS TERMINATED BY ';' ENCLOSED BY '"' LINES TERMINATED BY '\r\n' (@myid, @mypair) SET id = UNHEX(@myid), pair = UNHEX(@mypair);` — user3916429, Jan 31 '17 at 18:13
And inserts are still very slow. 10millions rows using LOAD DATA took 3h. (its a normal computer, 2Gb RAM tho.) — user3916429, Jan 31 '17 at 23:44
Thank you for all your time. I am already sorting the CSV data file before LOAD DATA. (is it different than sorting using a tmp table maybe?) im using the powerfull `sort` from CoreUtils. I will still try what you are saying for comparaison. There is nothing more you can do for me i guess right now :) You covered most of possibilities i think. Thanks again =) — user3916429, Feb 01 '17 at 06:03
I'm worried that the sort order _before_ the `UNHEX` will be different than _after_. I think, but have not proven, that they will be the same, so your point is valid. — Rick James, Feb 01 '17 at 06:23

score 0 · Answer 2 · edited May 23 '17 at 12:09

Optimizing a table requires that you optimize for specific queries. You can't determine the best optimization strategy unless you have specific queries in mind. Any optimization improves one type of query at the expense of other types of queries.

For example, if your query is SELECT SUM(pair) FROM keypairs (a query that would have to scan the whole table anyway), partitioning won't help, and just adds overhead.

If we assume your typical query is inserting or selecting one keypair at a time by its primary key, then yes, partitioning can help a lot. It all depends on whether the optimizer can tell that your query will find its data in a narrow subset of partitions (ideally one partition).

Also make sure to tune MyISAM. There aren't many tuning options:

Allocate key_buffer_size as high as you can spare to cache your indexes. Though I haven't ever tried anything higher than about 10GB, and I can't guarantee that MyISAM key buffers are stable at 53GB (the size of your MYI file).
Pre-load the key buffers: https://dev.mysql.com/doc/refman/5.7/en/cache-index.html
Size read_buffer_size and read_rnd_buffer_size appropriately given the queries you run. I can't give a specific value here, you should test different values with your queries.
Size bulk_insert_buffer_size to something large if you want to speed up LOAD DATA INFILE. It's 8MB by default, I'd try at least 256MB. I haven't experimented with that setting, so I can't speak from experience.

I try not to use MyISAM at all. MySQL is definitely trying to deprecate its use.

...is there a mysql command to ALTER TABLE add INT ID increment column automatically?

Yes, see my answer to https://stackoverflow.com/a/251630/20860

Yuri G · Answer 3 · 2017-01-31T02:12:57.033

-1

First, your primary key is not incrementable. Which means, roughly: at every insert the index have to be rebalanced.

No wonder it goes slowpoke at the table of such a size. And such an engine...

So, to the second: what's the point of keeping that MyISAM old junk?

Like, for example, you don't mind to loose row or two (or -teen) in case of an accident? And etc, etc, etc, even setting aside that current MySQL maintainer (Oracle Corp) explicitly discourages usage of MyISAM.

So, here are possible solutions:

1) Switch to Inno;

2) If you can't surrender the char ID, then:

Add autoincrement numerical key and set it primary - then, index would be clustered and the cost of insert would drop significantly;
Turn your current key into secondary index;

3) In case you can - it's obvious

edited Jan 31 '17 at 02:12

answered Jan 30 '17 at 23:07

Yuri G

1,206
1
9
13

thanks. my DB was inno at first. but looking at some comparative in here, inno is slower for LOAD DATA than MyISAM, so i switched :/ Wouldn't have 2 keys (INT PRIMARY, CHAR INDEX?) slow down inserts more again? Partitionning is not a good idea either ? – user3916429 Jan 30 '17 at 23:13
oh and too late for edit, but if its the solution, is there a mysql command to ALTER TABLE add INT ID increment column automaticaly? – user3916429 Jan 30 '17 at 23:22
Hold on, so the most of your data additions, what it is: inserts or bulk loads? Prioritize what you're aiming for. For the bulk loads, consider this tips from MySQL guys: https://dev.mysql.com/doc/refman/5.7/en/optimizing-innodb-bulk-data-loading.html Then, don't worry about secondary indexes - worry about primary. Its consistency is essential for (atomic) insert operation to succeed (that's a "particle" in the atom). Secondary index's is not, and it would be rebuilt in parallel, succeeding later (see: https://dev.mysql.com/doc/refman/5.5/en/innodb-performance-change_buffering.html) – Yuri G Jan 31 '17 at 01:57
As for alter table you mentioned - yes, such an op exists, but that won't help you: even considering the "conversion declaration" for the table exists - with autoincremental index you'd better rely on its values being generated on inserts. So you better create a new Inno table, and then roll the data over to a new one with INSERT INTO ... SELECT. And that is safer, after all. – Yuri G Jan 31 '17 at 02:10
Downvoting because of multiple wrong statements. See my Answer for more info. – Rick James Jan 31 '17 at 06:08

Handling huge MyISAM table for optimisation

3 Answers3