- Sort the incoming data before doing the
LOAD
. This will improve the cacheability of the PRIMARY KEY(id)
.
PARTITIONing
is unlikely to help, unless there is some useful pattern to ID
.
PARTITIONing
will not help for single-row insert nor for single-row fetch by ID
.
- If the strings are not a constant width of 60, you are wasting space and speed by saying
CHAR
instead of VARCHAR
. Change that.
- MyISAM's
FIXED
is useful only if there is a lot of 'churn' (deletes+inserts, and/or updates).
- Smaller means more cacheable means less I/O means faster.
- The
.frm
is an encoding of the CREATE TABLE
; it is not relevant for this discussion.
- A simple compress/zip/whatever will almost always compress text strings longer than 10 characters. And they can be uncompressed, losslessly. What do your strings look like? 60-character English text will shrink to 20-25 bytes.
- MD5 is a "digest", not a "compression". You cannot recover the string from its MD5. Anyway, it would take 16 bytes after converting to
BINARY(16)
.
- The
PRIMARY KEY
is a BTree. If ID
is somewhat "random", then the 'next' ID (unless the input is sorted) is likely not to be cached. No, the BTree is not rebalanced all the time.
- Turning the
PRIMARY KEY
into a secondary key (after adding an AUTO_INCREMENT
) will not speed things up -- it still has to update the BTree with ID
in it!
- How much RAM do you have? For your situation, and for this LOAD, set MyISAM's
key_buffer_size
to about 70% of available RAM, but not bigger than the .MYI
file. I recommend a big key_buffer because that is where the random accesses are occurring; the .MYD
is only being appended to (assuming you have never deleted any rows).
- We do need to see your
SELECTs
to make sure these changes are not destroying performance somewhere else.
- Make sure you are using
CHARACTER SET
latin1 or ascii; utf8 would waste a lot more space with CHAR
.
Switching to InnoDB will double, maybe triple, the disk space for the table (data+index). Therefore, it will probably show down. But a mitigating factor is that the PK is "clustered" with the data, so you are not updating two things for each row inserted. Note that key_buffer_size
should be lowered to 10M and innodb_buffer_pool_size
should be set to 70% of available RAM.
(My bullet items apply to InnoDB except where MyISAM is specified.)
In using InnoDB, it would be good to try to insert 1000 rows per transaction. Less than that leads to more transaction overhead; more than that leads to overrunning the undo log, causing a different form of slowdown.
Hex ID
Since ID
is always 60 hex digits, declare it to be BINARY(30)
and pack them via UNHEX(...)
and fetch via HEX(ID)
. Test via WHERE ID = UNHEX(...)
. That will shrink the data about 25%, and MyISAM's PK by about 40%. (25% overall for InnoDB.)
To do just the conversion to BINARY(30)
:
CREATE TABLE new (
ID BINARY(30) NOT NULL,
`pair` char(60) NOT NULL
-- adding the PK later is faster for MyISAM
) ENGINE=MyISAM;
INSERT INTO new
SELECT UNHEX(ID),
pair
FROM keypairs;
ALTER TABLE keypairs ADD
PRIMARY KEY (`ID`); -- For InnoDB, I would do differently
RENAME TABLE keypairs TO old,
new TO keypairs;
DROP TABLE old;
Tiny RAM
With only 2GB of RAM, a MyISAM-only dataset should use something like key_buffer_size=300M
and innodb_buffer_pool_size=0
. For InnoDB-only: key_buffer_size=10M
and innodb_buffer_pool_size=500M
. Since ID
is probably some kind of digest, it will be very random. The small cache and the random key combine to mean that virtually every insert will involve a disk I/O. My first estimate would be more like 30 hours to insert 10M rows. What kind of drives do you have? SSDs would make a big difference if you don't already have such.
The other thing to do to speed up the INSERTs
is to sort by ID
before starting the LOAD
. But that gets tricky with the UNHEX
. Here's what I recommend.
- Create a MyISAM table,
tmp
, with ID BINARY(30)
and pair
, but no indexes. (Don't worry about key_buffer_size
; it won't be used.)
LOAD
the data into tmp
.
ALTER TABLE tmp ORDER BY ID;
This will sort the table. There is still no index. I think, without proof, that this will be a filesort, which is much faster that "repair by key buffer" for this case.
INSERT INTO keypairs SELECT * FROM tmp;
This will maximize the caching by feeding rows to keypairs
in ID
order.
Again, I have carefully spelled out things so that it works well regardless of which Engine keypairs
is. I expect step 3 or 4 to take the longest, but I don't know which.