The first approach, with joins, is by far faster. In second the query will be executed for each row. Some databases optimize nested queries into joins though.
Join vs. sub-query
Article MySQL performance: INNER JOIN vs. sub-select
I found using a 'virtual table' instead of a ROW subquery is much faster on my table. It seems that the row subquery isn't optimized, where the join over the 'virtual table' is optimized.
Below are the queries and the 'EXPLAIN' returned for educational purposes.
-- Query using ROW subquery
EXPLAIN
SELECT
*
FROM
region
WHERE
ROW (PDB,CHAIN) IN (
SELECT
region.PDB,
region.CHAIN
FROM
region LEFT JOIN split_domain USING (SUNID)
WHERE
split_domain.SUNID IS NULL
GROUP BY
PDB, CHAIN
HAVING
COUNT(*)>1
)
LIMIT
10
;
+----+--------------------+--------------+--------+---------------+---------+---------+------------------------+-------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------+--------+---------------+---------+---------+------------------------+-------+--------------------------------------+
| 1 | PRIMARY | region | ALL | NULL | NULL | NULL | NULL | 57362 | Using where |
| 2 | DEPENDENT SUBQUERY | region | ALL | NULL | NULL | NULL | NULL | 57362 | Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | split_domain | eq_ref | PRIMARY | PRIMARY | 3 | scop_1_65.region.SUNID | 1 | Using where; Using index; Not exists |
+----+--------------------+--------------+--------+---------------+---------+---------+------------------------+-------+--------------------------------------+
3 rows in set (0.04 sec)
I can't get any results out of the above (takes too long) - Perhaps the limit clause isn't kicking in?
-- Query using joined virtual table
EXPLAIN
SELECT
*
FROM
region
INNER JOIN (
SELECT
region.PDB,
region.CHAIN
FROM
region LEFT JOIN split_domain USING (SUNID)
WHERE
split_domain.SUNID IS NULL
GROUP BY
PDB, CHAIN
HAVING
COUNT(*)>1
) AS x
ON
region.PDB = x.PDB
AND
region.CHAIN = x.CHAIN
LIMIT
10
;
+----+-------------+--------------+--------+---------------------+-----------+---------+------------------------+-------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+---------------------+-----------+---------+------------------------+-------+--------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 8624 | |
| 1 | PRIMARY | region | ref | PDB,CHAIN,pdb_chain | pdb_chain | 5 | x.PDB,x.CHAIN | 1 | |
| 2 | DERIVED | region | ALL | NULL | NULL | NULL | NULL | 57362 | Using temporary; Using filesort |
| 2 | DERIVED | split_domain | eq_ref | PRIMARY | PRIMARY | 3 | scop_1_65.region.SUNID | 1 | Using where; Using index; Not exists |
+----+-------------+--------------+--------+---------------------+-----------+---------+------------------------+-------+--------------------------------------+
4 rows in set (1.02 sec)
The above returns ...
10 results in about 1 second
100 results in about 1 second
1000 results in about 1.5 seconds
Full set (20437) in about 2 seconds
The former query doesn't return in 5 mins (even with limit 10).
I hope this is useful to anyone designing (or trying to optimize) complex subqueries, and that the precise details of the data are not necessary to convey the results presented here.