Why does adding an INNER JOIN make this query so slow?

Question

I have a database with the following three tables:

matches table has 200,000 matches...

CREATE TABLE `matches` (
`match_id` bigint(20) unsigned NOT NULL,
`start_time` int(10) unsigned NOT NULL,
PRIMARY KEY (`match_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

heroes table has ~100 heroes...

CREATE TABLE `heroes` (
`hero_id` smallint(5) unsigned NOT NULL,
`name` char(40) NOT NULL,
PRIMARY KEY (`hero_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

matches_heroes table has 2,000,000 relationships (10 random heroes per match)...

CREATE TABLE `matches_heroes` (
`relation_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`match_id` bigint(20) unsigned NOT NULL,
`hero_id` smallint(6) unsigned NOT NULL,
PRIMARY KEY (`relation_id`),
KEY `match_id` (`match_id`),
KEY `hero_id` (`hero_id`),
CONSTRAINT `matches_heroes_ibfk_2` FOREIGN KEY (`hero_id`)
REFERENCES `heroes` (`hero_id`),
CONSTRAINT `matches_heroes_ibfk_1` FOREIGN KEY (`match_id`)
REFERENCES `matches` (`match_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=3689891 DEFAULT CHARSET=utf8

The following query takes over 1 second, which seems pretty slow to me for something so simple:

SELECT SQL_NO_CACHE COUNT(*) AS match_count
FROM matches INNER JOIN matches_heroes ON matches.match_id = matches_heroes.match_id
WHERE hero_id = 5

Removing only the WHERE clause doesn't help, but if I take out the INNER JOIN also, like so:

SELECT SQL_NO_CACHE COUNT(*) AS match_count FROM matches

...it only takes 0.05 seconds. It seems that INNER JOIN is very costly. I don't have much experience with joins. Is this normal or am I doing something wrong?

UPDATE #1: Here's the EXPLAIN result.

id  select_type  table          type   possible_keys                     key     key_len  ref                                rows  Extra  
1   SIMPLE       matches_heroes ref    match_id,hero_id,match_id_hero_id hero_id 2        const                              34742
1   SIMPLE       matches        eq_ref PRIMARY                           PRIMARY 8        mydatabase.matches_heroes.match_id 1     Using index

UPDATE #2: After listening to you guys, I think it's working properly and this is simply as fast as it gets. Please let me know if you disagree. Thanks for all the help. I really appreciate it.

Do you have an index on hero_id? What does EXPLAIN() tell you about that query? — Cornelius, Sep 10 '14 at 11:07
It looks like `hero_id = 5` requires a table scan. Perhaps you could add an index on that field. — Andomar, Sep 10 '14 at 11:08
I thought it does have one. Don't the foreign keys require one? Doesn't this mean it has one? "...KEY `hero_id` (`hero_id`), CONSTRAINT `matches_heroes_ibfk_2` FOREIGN KEY (`hero_id`) REFERENCES `heroes` (`hero_id`)..." And how do I use EXPLAIN()? Sorry for my nubness. @Cornelius @Andomar — DaiBu, Sep 10 '14 at 11:25
You have foreign key without index on fields `matches_heroes.hero_id` and `matches_heroes.match_id`. I recommend you add index to `matches_heroes.hero_id` and repeat query. — user1516873, Sep 10 '14 at 11:34
I'm trying to add indexes using phpMyAdmin. It says they're already indexed though. In the Indexes section under Keyname it says "PRIMARY" for relation_id, "match_id" for match_id and "hero_id" for hero_id. If I go to the relation view, it says the same. Are these indexes already? I think phpMyAdmin forced me to make them when I did the foreign key stuff. @user1516873 — DaiBu, Sep 10 '14 at 11:49
@Andomar, @Cornelius, @user1516873 I've looked at his execution plan, and he has the keys there : http://sqlfiddle.com/#!2/14a8ba/1 `KEY match_id (match_id), KEY hero_id (hero_id),` in the `Create table`, [key is an alias for index](http://stackoverflow.com/questions/1401572/what-are-differences-between-index-v-s-key-in-mysql) — Stefan Rogin, Sep 10 '14 at 11:52
@Andomar so how come I get *-1* and you 3 get *+7* :) (btw thanks for the comment appreciation) — Stefan Rogin, Sep 10 '14 at 13:54

Stefan Rogin · Accepted Answer · 2014-09-10T21:31:02.153

Use COUNT(matches.match_id) instead of count(*), as when using joins it's best to not use the * as it does extra computation. Using columns from the join are the best way ensure you are not requesting any other operations. (not a problem on MySql InnerJoin, my bad).

Also you should verify that you have all keys defragmented, and enough ram free for the index to load in memory

Update 1:

Try to add a composed index for match_id,hero_id as it should give better performance.

ALTER TABLE `matches_heroes` ADD KEY `match_id_hero_id` (`match_id`,`hero_id`)

Update 2:

I wasn't satisfied with the accepted answer, that mysql is that slow for just 2 mill records and I runed benchmarks on my ubuntu PC (i7 processor, with standard HDD).

-- pre-requirements

CREATE TABLE seq_numbers (
    number INT NOT NULL
) ENGINE = MYISAM;


DELIMITER $$
CREATE PROCEDURE InsertSeq(IN MinVal INT, IN MaxVal INT)
    BEGIN
        DECLARE i INT;
        SET i = MinVal;
        START TRANSACTION;
        WHILE i <= MaxVal DO
            INSERT INTO seq_numbers VALUES (i);
            SET i = i + 1;
        END WHILE;
        COMMIT;
    END$$
DELIMITER ;

CALL InsertSeq(1,200000)
;

ALTER TABLE seq_numbers ADD PRIMARY KEY (number)
;

--  create tables

-- DROP TABLE IF EXISTS `matches`
CREATE TABLE `matches` (
`match_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`start_time` int(10) unsigned NOT NULL,
PRIMARY KEY (`match_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
;

CREATE TABLE `heroes` (
`hero_id` smallint(5) unsigned NOT NULL AUTO_INCREMENT,
`name` char(40) NOT NULL,
PRIMARY KEY (`hero_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
;

CREATE TABLE `matches_heroes` (
`match_id` bigint(20) unsigned NOT NULL,
`hero_id` smallint(6) unsigned NOT NULL,
PRIMARY KEY (`match_id`,`hero_id`),
KEY (match_id),
KEY (hero_id),
CONSTRAINT `matches_heroes_ibfk_2` FOREIGN KEY (`hero_id`) REFERENCES `heroes` (`hero_id`),
CONSTRAINT `matches_heroes_ibfk_1` FOREIGN KEY (`match_id`) REFERENCES `matches` (`match_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=MyISAM DEFAULT CHARSET=utf8
;
-- insert DATA
-- 100
INSERT INTO heroes(name)
SELECT SUBSTR(CONCAT(char(RAND()*25+65),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97),char(RAND()*25+97)),1,RAND()*9+4) as RandomName
FROM seq_numbers WHERE number <= 100

-- 200000
INSERT INTO matches(start_time)
SELECT rand()*1000000
FROM seq_numbers WHERE number <= 200000

-- 2000000
INSERT INTO matches_heroes(hero_id,match_id)
SELECT a.hero_id, b.match_id
FROM heroes as a
INNER JOIN matches as b ON 1=1
LIMIT 2000000

-- warm-up database, load INDEXes in ram (optional, works only for MyISAM tables)
LOAD INDEX INTO CACHE matches_heroes,matches,heroes


-- get random hero_id
SET @randHeroId=(SELECT hero_id FROM matches_heroes ORDER BY rand() LIMIT 1);


-- test 1 

SELECT SQL_NO_CACHE @randHeroId,COUNT(*) AS match_count
FROM matches as a 
INNER JOIN matches_heroes as b ON a.match_id = b.match_id
WHERE b.hero_id = @randHeroId
; -- Time: 0.039s


-- test 2: adding some complexity 
SET @randName = (SELECT `name` FROM heroes WHERE hero_id = @randHeroId LIMIT 1);

SELECT SQL_NO_CACHE @randName, COUNT(*) AS match_count
FROM matches as a 
INNER JOIN matches_heroes as b ON a.match_id = b.match_id
INNER JOIN heroes as c ON b.hero_id = c.hero_id
WHERE c.name = @randName
; -- Time: 0.037s

Conclusion: The test results are about 20x faster, and my server load was about 80% before testing as it's not a dedicated mysql server and had other cpu intensive tasks running, so if you run the whole script (from above) and get lower results it can be because:

you have a shared host, and the load is too big. In this case there isn't much you can do: you either complain to your current host, pay for a better host/vm or try another host
your configured key_buffer_size(for MyISAM) or innodb_buffer_pool_size(for innoDB) is too small, the optimum size would be over 150MB
your available ram is not enough, you would require about 100 - 150 mb of ram for the indexes to be loaded into memory. solution: free up some ram or buy more of it

Note that by using the test script, the generating of new data rules out the index fragmentation problem. Hope this helps, and ask if you have issues in testing this.

obs:

SELECT SQL_NO_CACHE COUNT(*) AS match_count 
FROM matches INNER JOIN matches_heroes ON matches.match_id = matches_heroes.match_id 
WHERE hero_id = 5`

is the equivalent to:

SELECT SQL_NO_CACHE COUNT(*) AS match_count 
FROM matches_heroes 
WHERE hero_id = 5`

So you wouldn't require a join, if that's the count you need, but I'm guessing that was just an example.

I see you you striked out your first paragraph as being not related to the Inner Join problem. I'd like to add that it is even wrong. COUNT(\*) simply counts records, whereas COUNT(matches.match_id) counts records for which matches.match_id is not null. So `COUNT(matches.match_id)` may lead to extra work not vice versa. Always use COUNT(\*) when possible; only use COUNT(something) when you really want to only count non-null values or count distinct values (with keyword DISTINCT). — Thorsten Kettner, Sep 10 '14 at 11:51
Can I "defragment keys" in phpMyAdmin? I'll try the ADD KEY suggestion and if it doesn't work, I'll put the EXPLAIN contents in the post up there. — DaiBu, Sep 10 '14 at 11:52
Oh, ic. Are you saying a dual index would be better design? I wasn't sure if the relation_id is necessary. Should I take it out and use the match and hero id combination as the primary key instead? @Thorsten Kettner — DaiBu, Sep 10 '14 at 11:55
@ThorstenKettner yes, in this case you are right, but count(*) on some sql implementations are not optimised and have found that using the column used in the join gives the best performance, although as you said in some case it can alter the results — Stefan Rogin, Sep 10 '14 at 11:58
Ah, so I added the match_id_hero_id key and it seems to run slightly faster! Down to 0.7 secs over a few runs. Is it possible that dealing with 2,000,000 relationships just takes that much time? — DaiBu, Sep 10 '14 at 12:00
@DaiBu extra keys/indexes usually only affect your insert performance, so in your case it won't affect your query, but it's best that you only create them when you use them in selects or when you need to imply a uniqueness rule. — Stefan Rogin, Sep 10 '14 at 12:02
@DaiBu: My comment was only a general comment on using COUNT(\*) vs. COUNT(something) and has nothing to do with your request actually. The composed index that clickstefan suggests makes your query faster btw, because it countains all columns used in the query, so the table itself doesn't have to be read anymore. — Thorsten Kettner, Sep 10 '14 at 12:08
Ok, think Ima assume this is as fast as it gets. Thanks guys. @ThorstenKettner — DaiBu, Sep 10 '14 at 12:28
@DaiBu have you tried the script above? I'm curious of it's outcome. — Stefan Rogin, Sep 10 '14 at 21:32
Oh, just saw this. Awesome man, I can tell you put a lot of work into it. I don't have time right now, but I'll test it later today. So if I understand, you want me to make a fresh database and then run all the statements at once (eg. in phpMyAdmin SQL tab)? — DaiBu, Sep 11 '14 at 17:22
Ok, so all the @rand stuff was blazingly fast but returned NULL in the first column. I changed your second query to "SELECT SQL_NO_CACHE b.hero_id, COUNT(*) AS match_count FROM matches as a INNER JOIN matches_heroes as b ON a.match_id = b.match_id INNER JOIN heroes as c ON b.hero_id = c.hero_id WHERE c.name = 'Epnsaxqj'" and it was still very fast at 0.08 seconds and returned a match count of 20000 (expected). It seems more complicated than my queries too. This is awesome, right? So do you think it's an index fragmentation problem like you mentioned? — DaiBu, Sep 11 '14 at 18:11
Also, I see you're using MyISAM and I'm using innoDB. I should also mention that each match in my database only contains 10 heroes. Dunno how this would affect performance. — DaiBu, Sep 11 '14 at 18:13
@DaiBu MyISAM vs InnoDB, it shouldn't matter that much, you can benchmark and recreate the tables using InnoDB just to be sure. `@rand`... variables are only available in that sql connection/session, probably you didn't execute in the same batch. As to the queries, the first is identical with your test, the second has an extra join to it. — Stefan Rogin, Sep 11 '14 at 20:07
Now to see what the problem was, run `Optimize table matches_heroes`, `Optimize table heroes`, `Optimize table matches`, and rerun the tests on the old data, if this works then you had some index fragmentation problems, if not recreate the tables with my definition (see that I've removed the `combination_id` column) on InnoDB, if still bad performance it means that you either have InnoDB misconfigured(check the `innodb_buffer_pool_size` setting ), or somehow innoDB has more overhead for this operation — Stefan Rogin, Sep 11 '14 at 20:19
OPTIMIZE didn't seem to do anything. When I recreated the database though, it worked great! Seriously, thanks for the help man! I had already assumed it was the fastest it could be and would have had a painfully slow website that might have never been fixed. Really appreciate it. Went from a negative score to best answer. ^_^ — DaiBu, Sep 12 '14 at 06:49

Thorsten Kettner · Answer 2 · 2014-09-10T12:43:52.283

1

So you say reading a table of 200,000 records is faster than reading a table of 2,000,000 records, finding the desired ones, then take them all to find matching records in the 200,000 record table?

And this surprises you? It's simply a lot of more work for the dbms. (It can even be, btw, that the dbms decides not to use the hero_id index when it considers a full table scan to be faster.)

So in my opinion there is nothing wrong with what is happening here.

edited Sep 10 '14 at 12:43

answered Sep 10 '14 at 12:02

Thorsten Kettner

89,309
7
49
73

Yes, I know it should be slower. It just seems like when they are indexed, the underlying database logic would be able to find the desired hero_ids very quickly, then in turn count the matches they are related to very quickly. I don't know how databases work, but I assume it's the way a tree or map works in programming which would be a simple matter of / 2 / 2 / 2 / 2 until you find the index that points to hero_id = 5, then count the matches that it points to (which in this case is equal to the original count of hero_id = 5 anyway). But yeah, I guess maybe it takes more time than I expected. – DaiBu Sep 10 '14 at 12:25
I agree, from a query optimization stand-point that's about it, all you can do now is to optimize your server configuration. For example my i7 Linux PC can do a similar join query between a 39 million and 17 million table in about 1.3 sec without a composed key, so if you have the hardware you should be able to increase that value. – Stefan Rogin Sep 10 '14 at 12:29
Ah, I don't have access to the machine. I'm using Bluehost VPS service. I can go into WHM and change settings if you had any configuration tips simple enough to put in a comment though. @clickstefan – DaiBu Sep 10 '14 at 12:42

Why does adding an INNER JOIN make this query so slow?

2 Answers2

Update 1:

Update 2: