I need a piece of advice, building an app now and I need to run some queries on rather large tables, possibly at a very frequent rate, so I'm trying to get the best approach performance wise.
I have the following 2 tables:
Albums:
+---------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| eventid | int(11) | NO | MUL | NULL | |
| album | varchar(200) | NO | | NULL | |
| filename | varchar(200) | NO | | NULL | |
| obstacle_time | time | NO | | NULL | |
+---------------+--------------+------+-----+---------+----------------+
and keywords:
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| eventid | int(11) | NO | MUL | NULL | |
| filename | varchar(200) | NO | | NULL | |
| bibnumbers | varchar(200) | NO | | NULL | |
| gender | varchar(20) | YES | | NULL | |
| top_style | varchar(20) | YES | | NULL | |
| pants_style | varchar(20) | YES | | NULL | |
| other | varchar(20) | YES | | NULL | |
| cap | varchar(200) | NO | | NULL | |
| tshirt | varchar(200) | NO | | NULL | |
| pants | varchar(200) | NO | | NULL | |
+-------------+--------------+------+-----+---------+----------------+
Both table have a unique_index declared which is a constraint of the eventid+filename
column.
Both table contains information about some images, but the albums table is available instantly (as soon as I have the images), while the keywords table usually becomes available several days later after a manual tagging of the images is completed
Now I will have people searching for all kind of things once the tagging is enabled, but since the results can be HUGE (up to 10.000 or more) I'm only showing them in small chunks so the browser doesn't get killed with trying to load a huge amount of images, because of this my server will be hit with loads of query requests (every time the visitor scrolls to the bottom of the page, an ajax query will return the next chunk of images).
Now my question is, which of the following queries is better performance wise:
SELECT `albums`.`filename`,`basket`.`id`,`albums`.`id`,`obstacle_time`
FROM `albums`
LEFT JOIN `basket`
ON `basket`.`eventid` = `albums`.`eventid`
AND `basket`.`fileid` = `albums`.`id`
AND `basket`.`visitor_id` = 1
LEFT JOIN `keywords`
ON `keywords`.`eventid` = `albums`.`eventid`
AND `albums`.`filename` = `keywords`.`filename`
WHERE
`albums_2015`.`eventid` = 1
AND `album` LIKE '%string%'
AND `obstacle_time` >= '08:00:00'
AND `obstacle_time` <= '14:11:10'
AND `gender` = 1
AND `top_style` REGEXP '[[:<:]]0[[:>:]]|[[:<:]]1[[:>:]]'
AND `cap` = '2'
AND `tshirt` = '1'
AND `pants` = '3'
ORDER BY `obstacle_time`
LIMIT X, 10
OR using an IN
CLAUSE inside WHERE
like:
SELECT `albums`.`filename`,`basket`.`id`,`albums`.`id`,`obstacle_time`
FROM `albums`
LEFT JOIN `basket`
ON `basket`.`eventid` = `albums`.`eventid`
AND `basket`.`fileid` = `albums`.`id`
AND `basket`.`visitor_id` = 1
WHERE
`albums_2015`.`eventid` = 1
AND `album` LIKE '%string%'
AND `obstacle_time` >= '08:00:00'
AND `obstacle_time` <= '14:11:10'
AND `filename` IN (
SELECT `filename`
FROM `keywrods`
WHERE
`eventid` = 1
AND `gender` = 1
AND `top_style` REGEXP '[[:<:]]0[[:>:]]|[[:<:]]1[[:>:]]'
AND `cap` = '2'
AND `tshirt` = '1'
AND `pants` = '3'
)
ORDER BY `obstacle_time`
LIMIT X, 10
I had looked to similar questions but wasn't able to figure it out which is the best course of action.
My understanding so far is that:
Using
LEFT JOIN
takes advantages of INDEXING, BUT!!! if I use it I will get a full join of the tables even when I only need a significantly smaller result set, so it's almost a wast to join thousands of rows just to then filter out most of them.Using IN and subquery isn't indexed??? I'm not 100% sure about this, I'm using MySQL 5.6 and to the best of my understanding since 5.6 even subqueries get automatically indexed my MySQL. I think this method has benefits when there result is significantly filtered, not sure if there will be any benefit if the subquery will return all the possible filenames.
As footnote questions:
Should I consider returning the whole result to the client on the first query and use client side (HTML) techniques to load the images gradually rather than re-querying the server each time?
Should I consider merging the 2 tables into 1, how much of a performance impact will that have? (can be tricky due to various reasons, which have no place in the question)
Thanks.
EDIT 1
Explain for JOIN query:
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+
| 1 | SIMPLE | albums_2015 | ref | unique_index | unique_index | 4 | const | 6475 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | basket | ALL | NULL | NULL | NULL | NULL | 2 | Using where; Using join buffer (Block Nested Loop) |
| 1 | SIMPLE | keywords_2015 | eq_ref | unique_index | unique_index | 206 | const,mybibnumber.albums_2015.filename | 1 | Using index |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+
Using WHERE IN:
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+--+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+--+
| 1 | SIMPLE | albums_2015 | ref | unique_index | unique_index | 4 | const | 6475 | Using where; Using temporary; Using filesort | |
| 1 | SIMPLE | keywords_2015 | eq_ref | unique_index | unique_index | 206 | const,mybibnumber.albums_2015.filename | 1 | Using where | |
| 1 | SIMPLE | basket | ALL | NULL | NULL | NULL | NULL | 2 | Using where; Using join buffer (Block Nested Loop) | |
+----+-------------+---------------+--------+---------------+--------------+---------+----------------------------------------+------+----------------------------------------------------+--+
EDIT 2
I wasn't able to set up a SQL Fiddler (keep getting error of something went wrong), so I have created a test database on one of my servers.
Address: http://188.165.217.185/phpmyadmin/, user: temp_test, pass: test_temp
I'm still building the whole thing and I don't have all the values filled in yet, like top_style, pants_style, etc, so a more appropriate query for the test scenario will be:
WHERE IN:
SELECT `albums_2015`.`filename`,
`albums_2015`.`id`,
`obstacle_time`
FROM `albums_2015`
WHERE `albums_2015`.`eventid` = 1
AND `album` LIKE '%'
AND `obstacle_time` >= '08:00:00'
AND `obstacle_time` <= '14:11:10'
AND `filename` IN (SELECT `filename`
FROM `keywords_2015`
WHERE eventid = 1
AND
`bibnumbers` REGEXP '[[:<:]]113[[:>:]]|[[:<:]]106[[:>:]]')
ORDER BY `obstacle_time`
LIMIT 0, 10
LEFT JOIN
SELECT `albums_2015`.`filename`,`albums_2015`.`id`,`obstacle_time`
FROM `albums_2015`
LEFT JOIN `keywords_2015`
ON `keywords_2015`.`eventid` = `albums_2015`.`eventid`
AND `albums_2015`.`filename` = `keywords_2015`.`filename`
WHERE
`albums_2015`.`eventid` = 1
AND `album` LIKE '%'
AND `obstacle_time` >= '08:00:00'
AND `obstacle_time` <= '14:11:10'
AND `bibnumbers` REGEXP '[[:<:]]113[[:>:]]|[[:<:]]106[[:>:]]'
ORDER BY `obstacle_time`
LIMIT 0, 10