I am trying to understand exactly what is and is not useful in a multiple-field index. I have read this existing question (and many more) plus other sites/resources (MySQL Performance Blog, Percona slideshares, etc.) but I'm not totally confident that what I've found on the subject is current and accurate. So please bear with me while I repeat some of what I think I know.
By indexing wisely, I can not only reduce how long it takes to match my query condition(s), but also reduce how long it takes to fetch the fields I want in my query result.
The index is just a sorted, duplicated subset of the full data, paired with pointers (MyISAM) or PKs (InnoDB), that I can search more efficiently than the full table.
Given the above, using an index to match my condition(s) really happens in the same way as fetching my desired result, except I created this special-purpose table (the index) that gets me an intermediate result set really quickly; and with this intermediate result set I can retrieve my final desired result set much more efficiently than by performing a full table scan.
Furthermore, if the index covers all the fields in my query (not just the conditions), instead of an intermediate result set, the index will give me everything I need without having to fetch any rows from the complete table.
InnoDB tables are clustered on the PK, so rows with consecutive PKs are likely to be stored in the same block (given many rows per block), and I can grab a range of rows with consecutive PKs fairly efficiently.
MyISAM tables are not clustered; there is some hidden internal row ordering that has no fixed relation to the PK (or any index), so any time I want to grab a set of rows, I may have to retrieve a different block for every single row - even if these rows have consecutive PKs.
Assuming the above is at least generally accurate, here's my puzzle. I have a slowly changing dimension table defined with the following columns (more or less) and using MyISAM:
dim_owner_ID INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
person_ID INT UNSIGNED NOT NULL,
raw_name VARCHAR(92) NOT NULL,
first VARCHAR(30),
middle VARCHAR(50),
last VARCHAR(30),
suffix CHAR(3),
flag CHAR(1)
Each "owner" is a unique instance of a particular individual with a particular name, so if Sue Smith changes her name to Sue Brown, that results in two rows that are the same except for the last
field and the surrogate key. My understanding is that the only way to enforce this constraint internally is to do:
UNIQUE INDEX uq_owner_complete (person_ID, raw_name, first, middle, last, suffix, flag)
And that's basically going to duplicate the entire table (except for the surrogate key).
I also need to index a few other fields for quick joins and searches. While there will be some writes, and disk space is neither free nor infinite, read performance is absolutely the #1 priority here. These smaller indexes should serve very well to cover the conditions of the queries that will be run against the table, but in almost every case, the entire row needs to be selected.
With that in mind:
Is there any reasonable middle ground between sticking with short, single-field indexes (prefix where possible) and expanding every index to cover the entire table?
How would the latter be any different from storing the entire dataset five times on disk, but sorted differently each time?
Is there any benefit to adding the PK/surrogate ID to each of the smaller indexes in the hope that the query optimizer will be able to work some sort of index merge magic?
If this were an InnoDB index, the PK would already be there, but since it's MyISAM it's got pointers to the full rows instead. So if I'm understanding things correctly, there's no point (no pun intended) to adding the PK to any other index, unless doing so would allow the retrieval of the desired result set directly from the index. Which is not likely here.
I understand if it seems like I'm trying too hard to optimize, and maybe I am, but the tasks I need to perform using this database take weeks at a time, so every little bit helps.