2

I have two tables which are defined as below:

CREATE TABLE `a`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `cid` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '',
  `data` varchar(1024) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL DEFAULT '',
  `update_time` timestamp(0) NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP(0),
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 150 CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Compact;

there are 10k rows in a and 200k rows in b, all the data were generated by a random function. Below is a sample:

151 8VE6BU06    8VE6BU06    2019-09-12 23:07:39

Here are three tests,

1. Case 1 costs 2.889s

SELECT cid FROM `a` WHERE a.cid not in (select b.cid from b);

execution plan
1   PRIMARY            a    ALL                 10094   Using where
2   SUBQUERY           b    ALL                 199826      

2. case 2 costs 628.699s

delete from `a` WHERE a.cid not in (select b.cid from b);

execution plan
1   PRIMARY             a   ALL                 10094   Using where
2   DEPENDENT SUBQUERY  b   ALL                 199826  Using where

3. case 3 costs 0.036s

alter table b add index cid(cid);
delete from `a` WHERE a.cid not in (select b.cid from b);

execution plan                           
1   PRIMARY             a   ALL                             10094   Using where
2   DEPENDENT SUBQUERY  b   index_subquery  cid cid 302 func    1   Using index

Question

  1. Why test2 takes so long?
  2. Why test1 uses SUBQUERY while test2 uses DEPENDENT SUBQUERY? What's the reason test1 executes so fast comparing to test2?
  3. Why test3 is so fast comparing to test2?

MySQL Version: 5.6.20 innodb

Row data in table b is like:

151 8VE6BU06    8VE6BU06      2019-09-12 23:07:39 
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Wallace
  • 561
  • 2
  • 21
  • 54
  • 3
    The 2nd query modifies data which takes longer. – Shadow Sep 12 '19 at 15:37
  • Have a look at the execution plans – kjmerf Sep 12 '19 at 15:57
  • 1
    One involves the hard drive, the other doesn't and uses RAM. You just measured he speed of your I/O subsystem with these. There's not much to it really. – N.B. Sep 12 '19 at 16:46
  • Out of curiosity, does test2 still show as a DEPENDENT SUBQUERY if you change the subquery's `cid` to `b.cid` ? If so, it may be a bug in the query optimizer. – Powerlord Sep 15 '19 at 05:50
  • yes, same result after changing from cid to b.cid. Problem can be easily reoccurred by creating two tables (of same definition excluding table name), producing some random data. @Powerlord – Wallace Sep 15 '19 at 06:22
  • Please don't cross-post questions, see [Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site?](//meta.stackexchange.com/q/64068). The question is not off-topic here, so I'm not going to close and delete this copy, especially since there are now answers posted here. Please remove the copy on DBA. – Martijn Pieters Sep 16 '19 at 20:14
  • To all of you who claims "It's because of DELETE". Have a look at case #3. It is deleting the same rows as in #2 + index records, but is still much faster than 1# and #2. So it's not as simple as that. – Paul Spiegel Sep 21 '19 at 17:07
  • @PaulSpiegel, Case #3 was added in an edit to the OP. However, it still is predictable: In Case #3, the `DELETE` no longer has to perform a linear search of `table b` to find any foreign key references to the `table a` records being deleted because an index is now available. – daShier Sep 22 '19 at 15:47
  • @daShier And why are you telling me this? – Paul Spiegel Sep 23 '19 at 09:07

3 Answers3

4
  • 5.5 parses and optimizes DELETE and UPDATE differently (not as good) as SELECT. A later version (5.7? or 8.0) does better.
  • DELETE must create undo records in case of a crash midway through the DELETE; this is quite costly relative to just selecting.
  • Notice how the first two EXPLAINs say ALL and ALL. this implies that "for each row in one table, it scans all the rows of the other table". The third explain is orders of magnitude faster because of the INDEX.
  • In older versions, NOT IN ( SELECT ... ) was usually performed very poorly -- as in complete scans, etc.
  • Consider changing to a "multi-table DELETE" using a LEFT JOIN ... IS NULL instead of the IN approach.
  • If you are timing a 'cold' system, then there is a lot of I/O.
  • If innodb_buffer_pool_size is too small to hold all of b, this would lead to a lot of I/O, hence the long time for query 2.
  • Not knowing the average size of cid and data, it is not practical to discuss the I/O needs further.
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • " first two EXPLAINs say ALL and ALL" -- so does mysql really perform 10k * 200k times of scan in test1? – Wallace Sep 16 '19 at 01:48
  • Mysql version is 5.6.20, and average row data of table b is about 30bytes only. – Wallace Sep 16 '19 at 01:50
  • @Steve 10K * 200K rows are looked at. 30 bytes -- That table is likely to become cached and stay cached since it is only a few MB. – Rick James Sep 16 '19 at 04:54
  • 1
    rick, do you mean the test1 actually does 10000 times of scan of table b? and test1 does it so fast only because mysql may have cached whole table data (of table b)? – Wallace Sep 18 '19 at 01:30
  • 1
    @Steve - Yes. That's how to read the first 2 explains. In general (not always the case), multiply the numbers in the "Rows" column to get a very crude metric for how long the query will take. – Rick James Sep 18 '19 at 06:02
  • @Steve - When the blocks are cached in the buffer_pool they don't need to be read from disk. But there is still effort to look at the row(s) in the block(s). – Rick James Sep 18 '19 at 06:03
1

Are these truly measured in seconds? That seems very excessive for such simple queries on moderate size tables.

That said, in general a DELETE will take longer than a SELECT for the same data. The database engine needs to assure that any foreign key references are not broken and update indexes etc. This behavior seems normal.

daShier
  • 2,056
  • 2
  • 8
  • 14
0
  1. Read is faster than write/delete as @shadow commented.

  2. Refer to https://dev.mysql.com/doc/refman/5.5/en/subquery-optimization.html

  3. Indexing

Indexing will give you great performance.

To illustrate in simple way, index is like page0 of document.

Page0 index
Mysql p1
Oracle p2
...
Postgresql p100

Page1 
Some looooooong Mysql material

Page2 
Some looooooong Oracle material
...

Page100 
Some looooooong Postgresql material

Read with index

It takes relatively short time to find key within page0. For example, after finding postgresql in index and go to page 100, this way is much faster than reading hundreds pages.

Delete with index

You can delete only postgresql in page0 index. You don't have to erase all content in page100.

For more information, please refer to How does database indexing work?

John
  • 3,304
  • 1
  • 18
  • 26