Optimizing Update Query with compound index

Question

I tested an update between two large (~5 mil records each) which was taking 10 seconds or so per update. So, doing Explain for my very first time tested the select:

SELECT
    T1.Z, T2.Z
FROM
    TableB T1
INNER JOIN TableL T2
    on T1.Name=T2.Name 
   and T1.C=T2.C
   and T1.S=T2.S
   and T1.Number>=T2.MinNumber
   and T1.Number<=T2.MaxNumber

Explain returned the following as possible keys:

Name
C
S
Number

and chose C as the key.

I was told that my best bet was to make a compound key, and in the order of the select so I did

Alter Table TableB Add Index Compound (Name,C,S,Number)

And did an explain again, hoping it would choose my compound but now even though it shows the compound index as a possible key it still chooses Index C.

I read that I can force the index I want with:

SELECT
    T1.Z, T2.Z
FROM TableB T1 Force Index(Compound)
INNER JOIN TableL T2
    on T1.Name=T2.Name 
   and T1.C=T2.C
   and T1.S=T2.S
   and T1.Number>=T2.MinNumber
   and T1.Number<=T2.MaxNumber

yet I am not sure if it makes any sense to over-ride MySql's selection and, given that if it doesn't help the update is going to take almost two years it doesn't seem like a smart thing to test.

Is there some step I am missin? Do I need to remove the other keys so that it chooses my compound one and if so how will I know if it will even make a difference (given that Mysql saw it and rejected it)?

Explain output on T1: (note: I did not yet add the Compound Index as the table is huge and it might be wasted time until I figure this out. I previously added it on a highly truncated version of the table but that won't help with this explain) Table1

select_type: simple
type: ref
possible_keys:
Number,C,S,Name
key: Name
key_len: 303
ref: func
rows: 4
Extra: using where

Explain for Table2

select_type: SIMPLE
type: ALL
possible_Keys: MinNumber, MaxNumber
key:
key_length:
ref:
rows: 5,447,100
Extra:

Cardinality (only showing indexes relevant here as there are a few others):

Primary: 5139680
Name: 1284920
Number: 57749
C: 7002
S: 21

@Strawberry Please see my update this is not a duplicate question nor is it deserving of being pushed to a generic 'What is an Index?' thread. Thank you. — user3649739, May 15 '16 at 02:39
Please post the cardinality for each index, MySQL chooses one which is most restrictive and fits in memory. Also please post the full explain extended output. — Pentium10, May 15 '16 at 10:00
"T1.Name=T2.Name" - It's not an excuse for MySQL choosing a "bad" index - but it's also a bad idea to join big tables on VARCHAR columns. — Paul Spiegel, May 15 '16 at 13:23
@PaulSpiegel So perhaps I can then find distinct values for Name, C, S and assign ID fields e.g. Name_ID, C_ID, S_ID and join on those? — user3649739, May 15 '16 at 13:56
@Pentium10 I'll be glad to, I am not sure how to format tables as I see people do here with the -- and the | so I'll just do as a list if that is ok. Cardinality is as far as I can tell the count for a given field or group of fields so I'll include that too. — user3649739, May 15 '16 at 13:57
Use File -> Paste data to paste a tab delimited data into this formatter: http://www.tablesgenerator.com/text_tables — Pentium10, May 15 '16 at 14:04
@Pentium10 Thanks I've been trying to figure that out for awhile. Not to digress into a Formatting 101 Tutorial but when I try to paste the output from tablegenerator it is a mess stil. For now I just did a list and will replace once I figure out how to paste a clean table. — user3649739, May 15 '16 at 14:48
Currently trying the following: I made a Table called Unique with ID and UniqueRecord. I createda a Unique Index on that field. I then did Insert Ignore from both Table1 and Table2 using concat of Name,C,S and am now updating a UniqueRecordID field in both Table1 and Table2 where they Concat(Name,C,S) = UniqueRecord in Unique. The thought being that I can do my InnerJoin on UniqueRecord = UniqueRecord. — user3649739, May 15 '16 at 15:21
Wow. Entire select between the two 5 mil record tables took 72 seconds, prior it was taking 10 seconds per record. I will update with my own answer. Thanks for everyone's help! — user3649739, May 15 '16 at 15:40
@Strawberry Since I assume you removed the Dupe flag and the generic answer link I wanted to not only thank you but direct you back to the final question/answer since you'll see it turned out to be helpful to more than a few people (myself included) to reassure you that the decision was the right one. — user3649739, May 16 '16 at 00:21

score 3 · Accepted Answer · edited May 23 '17 at 12:06

So based on some great comments/input I came up with a solution. One flashbulb input from Paul Spiegel was that trying to join two 5+mil tables using several VarChar fields was not recommended.

So what I did was create a UniqueTable with ID and UnqiueRecord Fields.

I then made the UniqueRecord a Unique Index.

I inserted into that table from Both TableA and TableB as:

Insert IGNORE into `Unique` (UniqueRecord) 
Select Concat(Name,C,S) from Table1 Group by Name,C,S;
Insert IGNORE into `Unique` (UniqueRecord) 
Select Concat(Name,C,S) from Table2 Group by Name,C,S

This gave me unique records from both within and between the two tables.

I then added a UniqeRecord_ID field to both Table1 and Table 2.

I then did a join between each table and the UniqueRecord to write the UniqueRecord ID to each table:

Update Table1 as T1
Inner Join Unique as T2
On Concat(T1.Name,T1.S,T1.C) = T2.UniqueRecord
Set T1.UniqueRecord_ID=T2.ID

Finally, I added a key to each table on UniqueRecord_ID.

My Explain showed that it only used that key from T2 however whereas it was taking 10 seconds per record for the select prior (I tested on 1,10,100 and stopped there as I did not have the requisite 578 days to test the whole table :| ) the entire select, returning close to 5 million records took 72 seconds.

That still sounds incredibly slow. What hardware are you using? — Strawberry, May 16 '16 at 05:51
@Strawberry One minute to do an Update/Join between two 5 mil record tables including >=<= comparison on records matched by three fields sounds slow to you? In any event it is usable to me in one minute vs the 568 days I started at. If it helps here is the config: VM with 8x Virtual CPU 2.4Ghz 16Gb of RAM 300Gb SSD — user3649739, May 16 '16 at 12:39

score 0 · Answer 2 · answered May 25 '16 at 06:19

0

Note that the first table (whichever one it is) must be fully scanned. So, the best we can do is to have a good index on the second table.

The optimal index (as already noted) for T1 is (Name,C,S,Number). For T2 it is (Name,C,S,MinNumber,MaxNumber), which is bulkier.

The optimizer seems to want to start with T1; perhaps it is slightly smaller. Let's force it to start with T2 by changing INNER JOIN to STRAIGHT_JOIN and swapping the order:

SELECT
    T1.Z, T2.Z
FROM          TableL T2   -- note
STRAIGHT_JOIN TableB T1   -- note
    on T1.Name=T2.Name 
   and T1.C=T2.C
   and T1.S=T2.S
   and T1.Number>=T2.MinNumber
   and T1.Number<=T2.MaxNumber

Then, let's do one more optimization: If Z is not 'too big', let's include it at the end of the index so that it becomes a "Covering index":

INDEX(Name,C,S,Number,Z)

(Name, C, S can be in any order, but Number, Z must be in that order and at the end.) If you currently have INDEX(Name), DROP it as being redundant.

Then the EXPLAIN will say that you are doing a full table scan of T2, plus a "Using index" on T1.

Please provide SHOW CREATE TABLE; there may be more optimizations.

answered May 25 '16 at 06:19

Rick James

135,179
13
127
222

Thanks I will take a look a this option since I continue to do work with these tables. Question: Why must Name,Z be in that order and the rest don't matter? So far I have assumed (and changed the index to reflect) that the order of the index is helped by putting them in order of Cardinality thus in your suggested case i would be Name,Z,C,Number,S. Is that a baseless assumption (order composite by cardinality)? For instance S has a cardinality of 50 while Name has a Cardinality of 50,000 – user3649739 May 25 '16 at 13:51
Ignore cardinality. An index is handled left to write. Start with columns that are "= constant" (in any order), then optionally one range. If it is to be a "covering index", then the rest of the columns mentioned can be in any order. Caveat: That is a very short summary of good practice. [_More_](http://mysql.rjweb.org/doc.php/index_cookbook_mysql) – Rick James May 25 '16 at 20:43
Name, C, S are all in a `WHERE` (`ON` in this case)`, so any order is fine. Then Number is in a "range" (because of ">=" and "<="). I suggested tacking Z on to make it "covering". The index is structured as a BTree. It will drill down based on Name,C,S and the first value for Number. Then it will scan forward ("range") until the last Number. Z will provide the desired value for the SELECT list ("covering"). – Rick James May 25 '16 at 20:47

Optimizing Update Query with compound index

2 Answers2