Speeding up this big JOIN

Question

EDIT: there was a mistake in the following question that explains the observations. I could delete the question but this might still be useful to someone. The mistake was that the actual query running on the server was SELECT * FROM t (which was silly) when I thought it was running SELECT t.* FROM t (which makes all the difference). See tobyobrian's answer and the comments to it.

I've a too slow query in a situation with a schema as follows. Table t has data rows indexed by t_id. t adjoins tables x and y via junction tables t_x and t_y each of which contains only the foreigns keys required for the JOINs:

CREATE TABLE t (
  t_id INT NOT NULL PRIMARY KEY,
  data columns...
);
CREATE TABLE t_x (
  t_id INT NOT NULL,
  x_id INT NOT NULL,
  PRIMARY KEY (t_id, x_id),
  KEY (x_id)
);
CREATE TABLE t_y (
  t_id INT NOT NULL,
  y_id INT NOT NULL,
  PRIMARY KEY (t_id, y_id),
  KEY (y_id)
);

I need to export the stray rows in t, i.e. those not referenced in either junction table.

SELECT t.* FROM t
LEFT JOIN t_x ON t_x.t_id=t.t_id
LEFT JOIN t_y ON t_y.t_id=t.t_id
WHERE t_x.t_id IS NULL OR t_y.t_id IS NULL
INTO OUTFILE ...;

t has 21 M rows while t_x and t_y both have about 25 M rows. So this is naturally going to be a slow query.

I'm using MyISAM so I thought I'd try to speed it up by preloading the t_x and t_y indexes. The combined size of t_x.MYI and t_y.MYI was about 1.2 M bytes so I created a dedicated key buffer for them, assigned their PRIMARY keys to the dedicated buffer and LOAD INDEX INTO CACHE'ed them.

But as I watch the query in operation, mysqld is using about 1% CPU, the average system IO pending queue length is around 5, and mysqld's average seek size is in the 250 k range. Moreover, nearly all the IO is mysqld reading from t_x.MYI and t_x.MYD.

I don't understand:

Why mysqld is reading the .MYD files at all?
Why mysqld isn't using the preloaded the t_x and t_y indexes?

Could it have something to do with the t_x and t_y PRIMARY keys being over two columns?

EDIT: The query explained:

| id | select_type | table | type | possible_keys | key     | key_len | ref       | rows     | Extra       |
+----+-------------+-------+------+---------------+---------+---------+-----------+----------+-------------+
|  1 | SIMPLE      | t     | ALL  | NULL          | NULL    | NULL    | NULL      | 20980052 |             | 
|  1 | SIMPLE      | t_x   | ref  | PRIMARY       | PRIMARY | 4       | db.t.t_id |   235849 | Using index | 
|  1 | SIMPLE      | t_y   | ref  | PRIMARY       | PRIMARY | 4       | db.t.t_id |   207947 | Using where | 
+----+-------------+-------+------+---------------+---------+---------+-----------+----------+-------------+

this is not exactly related to your question, but you say you wanted anything not mentioned in either table. this is an ambiguous statement in english, do you want something not mentioned in 0/2 tables or mentioned in 1/2 or 0/2 tables? cause your query selects the latter — tobyodavies, Nov 25 '10 at 16:26
Does it differ when you don't `SELECT t.*` but only `SELECT t.t_id`? If that's fast(er), you can later SELECT t.* from ... WHERE t.t_ID IN (select t.t_id ...))? — Konerak, Nov 25 '10 at 16:31
i've just added an answer but re-reading your question you say its reading the data file for `t_x`?? ok, i'm as stumped as you are if thats the case... — tobyodavies, Nov 25 '10 at 16:42
@tobyodavis: I added explain output to the question. Rows in t referenced in both t_x and t_y are 'good' and the rest are 'stray'. I want to select the stray rows. — , Nov 25 '10 at 16:55

score 2 · Answer 1 · answered Nov 25 '10 at 17:36

2

Use not exists - this will be the fastest - much better than 'joins' or using 'not in' in this sitution.

SELECT t.* FROM t a
Where not exists (select 1 from t_x b
                  where b.t_id = a.t_id)
or not exists (select 1 from t_y c
                where c.t_id = a.t_id);

answered Nov 25 '10 at 17:36

Roopesh Shenoy

3,389
1
33
50

of course I just realized this has nothing to do with the two questions you have asked - but hopefully modifying the query to this will speed up your performance to some extent. – Roopesh Shenoy Nov 25 '10 at 17:44
`Citation needed` ;) - can you show something to back up your claim that a `NOT EXISTS` subquery performs faster than a (ref/eq_ref) join? – Konerak Nov 25 '10 at 20:46
Exists and not exists return as soon as the first record is found in the subquery - that is sufficient to evaluate the subquery to true or false. http://stackoverflow.com/questions/2065329/sql-server-in-vs-exists-performance http://sqlinthewild.co.za/index.php/2010/03/23/left-outer-join-vs-not-exists/ . In almost all cases that I have worked with, using exists dramatically improves performance. There are few places where indexes are not properly designed that in/join might be faster (I have read this, but not yet come across any instance myself). – Roopesh Shenoy Nov 26 '10 at 05:12

tobyodavies · Accepted Answer · 2010-11-25T16:38:53.697

1

I can answer part 1 of your question, and i may or may not be able to answer part two if you post the output of EXPLAIN:

In order to select t.* it needs to look in the MYD file - only the primary key is in the index, to fetch the data columns you requested it needs the rest of the columns.

That is, your query is quite probably filtering the results very quickly, its just struggling to copy all the data you wanted.

Also note that you will probably have duplicates in your output - if one row has no refs in t_x, but 3 in x_y you will have the same t.* repeated 3 times. Given we think the where clause is sufficiently efficient, and much time is spent on reading the actual data, this is quite possibly the source of your problems. try changing to select distinct and see if that helps your efficiency

edited Nov 25 '10 at 16:38

answered Nov 25 '10 at 16:33

tobyodavies

27,347
5
42
57

Well, you pointed me at my mistake. The actual query I was running failed to prefix the `*` with `t.`!!! That at explains the reading from the data files. So I stopped the query and decided to try only exporting `t_id`. Now mysqld is consuming 100% of one processor and file system IO is just sequential reading of `t.MYD`. – Nov 25 '10 at 17:07
you will still get duplicates though without the distinct qualifier. which i doubt is what you want. – tobyodavies Nov 25 '10 at 17:13

score 0 · Answer 3 · answered Nov 25 '10 at 16:53

0

This may be a bit more efficient:

SELECT * 
FROM t
WHERE t.id NOT IN (
  SELECT DISTINCT t_id
  FROM t_x
  UNION
  SELECT DISTINCT t_id
  FROM t_y
);

answered Nov 25 '10 at 16:53

symcbean

47,736
6
59
94

Always, ALWAYS replace in and not in with exists and not exists for better performance. – Roopesh Shenoy Nov 25 '10 at 17:37
Also this is wrong - it needs to be an or not an and.. so you need two separate sub queries separated by an or. In this query, if the id is present in t_x but not in t_y, the row is not selected, but in the actual question query, it is selected. – Roopesh Shenoy Nov 26 '10 at 05:15
OP wrote "I need to export the stray rows in t, i.e. those not referenced in either junction table" which is exactly what the query I provided does. Why do you think 'exists' is more efficient than 'in' on MySQL (can you provide references?) – symcbean Nov 30 '10 at 10:22

Speeding up this big JOIN

3 Answers3