MySql - Self Join - Full Table Scan (Cannot Scan Index)

Question

I have the following self-join query:

SELECT A.id
FROM mytbl      AS A
LEFT JOIN mytbl AS B 
ON (A.lft BETWEEN B.lft AND B.rgt)

The query is quite slow, and after looking at the execution plan the cause appears to be a full table scan in the JOIN. The table has only 500 rows, and suspecting this to be the issue I increased it to 100,000 rows in order to see if it made a difference to the optimizer's selection. It did not, with 100k rows it was still doing a full table scan.

My next step was to try and force indexes with the following query, but the same situation arises, a full table scan:

SELECT A.id
FROM categories_nested_set      AS A
LEFT JOIN categories_nested_set AS B 
FORCE INDEX (idx_lft, idx_rgt)
ON (A.lft BETWEEN B.lft AND B.rgt)

All columns (id, lft, rgt) are integers, all are indexed.

Why is MySql doing a full table scan here?

How can I change my query to use indexes instead of a full table scan?

CREATE TABLE mytbl ( lft int(11) NOT NULL DEFAULT '0', 
 rgt int(11) DEFAULT NULL, 
 id int(11) DEFAULT NULL,
 category varchar(128) DEFAULT NULL,
  PRIMARY KEY (lft), 
  UNIQUE KEY id (id), 
  UNIQUE KEY rgt (rgt), 
  KEY idx_lft (lft), 
  KEY idx_rgt (rgt) ) ENGINE=InnoDB DEFAULT CHARSET=utf8

Thanks

share the results of `show create table xyz` for each relevant xyz — Drew, Jul 21 '16 at 04:40
results below: `CREATE TABLE mytbl ( lft int(11) NOT NULL DEFAULT '0', rgt int(11) DEFAULT NULL, id int(11) DEFAULT NULL, category varchar(128) DEFAULT NULL, PRIMARY KEY (lft), UNIQUE KEY id (id), UNIQUE KEY rgt (rgt), KEY idx_lft (lft), KEY idx_rgt (rgt) ) ENGINE=InnoDB DEFAULT CHARSET=utf8` — mils, Jul 21 '16 at 05:11
A `PRIMARY KEY` is a `UNIQUE` key is a `KEY`. So the two `KEYs` are redundant and should be removed. — Rick James, Jul 22 '16 at 01:40

score 2 · Answer 1 · answered Jul 21 '16 at 06:30

2

You have lot's of indexes, some of them are redundant. Let's begin by clearing up some of them. Too many indexes slows down inserts and updates.

PRIMARY KEY (lft),
KEY idx_lft (lft),

Since you already have a primary key defined on lft, there is no need what so ever for another index on lft. Similarly with a unique index on rgt there is not need for the second index listed below.

UNIQUE KEY rgt (rgt), 
KEY idx_rgt (rgt)

Now let's look at your query.

SELECT A.id
FROM mytbl      AS A
LEFT JOIN mytbl AS B 
ON (A.lft BETWEEN B.lft AND B.rgt)

This is very unlikely to be a query that will be encountered in the wild. With 500 rows, this query may produce even 5000 rows? Do you really need the entire key created in one go? The reason that this query is slow is because mysql can only optimize range comparisions for constants. It's more likely that your actually query will look something like this:

SELECT B.*
FROM mytbl      AS A
LEFT JOIN mytbl AS B 
ON (A.lft BETWEEN B.lft AND B.rgt) 
WHERE a.id = N;

Where you create the node for a particular id. This will use indexes and will be really fast. What's the point of optimizing for a query that you will not use much if at all?

answered Jul 21 '16 at 06:30

e4c5

52,766
11
101
134

thanks for the response, I have updated my question with some extra info. Basically I can't do it with a WHERE clause as it is part of a greater JOIN. I stripped it down for this question for the sake of simplicity. And in the real use-case with the JOIN it does not use an index. In the scenario of a greater JOIN, is this consideder a constant for the sake of range comparisons or does it have to be a user-defined constant? How can I avoid table scans in this scenario? Thanks – mils Jul 21 '16 at 07:11
That's moving the goal posts and moving it a long distance – e4c5 Jul 21 '16 at 07:12
I did some performance testing, the size of this mytbl makes the single biggest impact on loading my data into another system. If I have 10k rows instead of 500, performance degrades 6000%. Right now, it's taking 4 hours and it can only get worse. So I would really appreciate knowing how to make MySql use an index for a range query – mils Jul 21 '16 at 07:37
Look, all this is completely different from the question that you asked initially. Please accept this answer and let's move forward to a new question. As mentioned in my answer , range optimization happens only with constants. You post your full query (the exact query please) with it's explain output. Also post the other tables that are involved. – e4c5 Jul 21 '16 at 08:00
mils, 40 minutes after he answered the question, you made an edit. Not that I spent more than 2 minutes looking at it, but e4c5 came to the conclusion you changed the question. – Drew Jul 21 '16 at 12:58
@Drew I haven't seen that done on SO before, pretty disappointing. Thanks for the response. – mils Jul 22 '16 at 00:46
@mils, from my angle as an Answer-only guy, it is pretty disappointing to produce something, and feel like you got the rug pulled out from under you. So, from the *other side* and I empathize with the Answer side. Again, I spent no time looking at this, trust e4c5's thought to do it, and I have seen it before. You can always ask around, go to Meta, ask perhaps someone in [SOCVR](http://chat.stackoverflow.com/rooms/41570) ... it is common that we see it there. – Drew Jul 22 '16 at 00:55
@mils I'm lacking domain knowledge regarding the question at hand to tell if this is really the case, but editing questions in a way that invalidates existing answers [has always been frowned upon](http://meta.stackexchange.com/questions/43478/exit-strategies-for-chameleon-questions). Be careful not to turn your question into a chameleon question. – Andras Deak -- Слава Україні Jul 22 '16 at 01:22
1

Thank you @AndrasDeak I have been burnt many times by chameleon questions. This is the first time I have actually rolled back an edit for that reason but it will not be the last. – e4c5 Jul 22 '16 at 01:47
Thank you @drew you have understood my actions perfectly. – e4c5 Jul 22 '16 at 01:47

score -1 · Accepted Answer · edited Aug 09 '16 at 15:37

-1

The following SO question is critical to the solution, as there is very little information on the combination of adjacency lists and indices:

MySQL & nested set: slow JOIN (not using index)

It appears that adding a basic comparison condition triggers the use of an index, like so:

SELECT A.id
FROM mytbl      AS A
LEFT JOIN mytbl AS B ON (A.lft BETWEEN B.lft AND B.rgt)
-- THE FOLLOWING DUMMY CONDITIONS TRIGGER INDEX
WHERE A.lft > 0
AND B.lft > 0
AND B.rgt > 0

And no more table scans.

EDIT: Comparison of EXPLAIN function between fixed and unfixed version of the query:

edited Aug 09 '16 at 15:37

Robert Harvey

178,213
47
333
501

answered Jul 22 '16 at 00:45

mils

1,878
2
21
42

Please test the following with and without that 'fix': `FLUSH STATUS; SELECT ...; SHOW SESSION STATUS LIKE 'Handler%';` If the number are the same between them, then there is still a 'full scan', but perhaps in the index instead of in the table. – Rick James Jul 22 '16 at 02:29
Thanks Rick, numbers below (zero numbers excluded): WITH FIX 'Handler_commit','1' 'Handler_external_lock','4' 'Handler_read_first','2' 'Handler_read_key','2' 'Handler_read_next','646' WITHOUT FIX 'Handler_commit','1' 'Handler_external_lock','4' 'Handler_read_first','72' 'Handler_read_key','72' 'Handler_read_rnd_next','37941' – mils Jul 22 '16 at 03:41
That convinces me that the fix helped. – Rick James Jul 22 '16 at 03:43
Perhaps `EXPLAIN SELECT ...` would point out what it decided to do differently. – Rick James Jul 22 '16 at 03:44
I have added a screenshot with results above – mils Jul 22 '16 at 08:00

MySql - Self Join - Full Table Scan (Cannot Scan Index)

2 Answers2

Linked