Removing all duplicates except one - optimized queries

Question

I have tried the following two queries:

delete from app where not exists
(select a2.app_package, max(a2.id) from (select * from app) as a2
 where a2.app_package = app.app_package having max(a2.id) = app.id);

AND

DELETE FROM app
USING app, 
(select app_package, max(id) as ID from app
group by app_package
) as A 
where A.ID > app.ID AND
A.app_package = app.app_package;

and am really stuck as to which one would execute faster.

SQLFiddles:

Both execution plans are the same:

ID  SELECT_TYPE TABLE   TYPE    POSSIBLE_KEYS   KEY KEY_LEN REF ROWS    FILTERED    EXTRA
1    SIMPLE    app    ALL                                        7    100

Are there further optimizations that could be made?

What exactly is the question -- they are both the same execution plan there won't be a difference in timing. — Hogan, Nov 21 '13 at 18:57
@Hogan So, same execution plan means they are exactly the same? Or are there differences? One is using a JOIN while the other uses exists. — Menelaos, Nov 21 '13 at 19:00
Some `JOIN`s can be expressed as a `EXIST` clause and any `EXIST` clause can be expressed as a `JOIN` -- so logically there is no difference, when compiled by the server they result in the same exact operations — Hogan, Nov 21 '13 at 19:01
@Hogan Would changing my query to be like this http://stackoverflow.com/a/854177/1688441 be faster? In general I'm wondering if there is something faster because we have a table with almost a million records. — Menelaos, Nov 21 '13 at 19:15

score 2 · Accepted Answer · answered Nov 21 '13 at 19:59

The execution plan you are showing, is not that of the DELETE query but that from the SELECT * FROM app query, which just does a full table scan (as expected as you aren't filtering on anything).

To see the execution plan, you will need to run the explain on the delete statements instead (appearantly not possible in sqlfiddle).

I took the liberty of assuming the you have an index on app_package. If you don't, you should definitely add it.

The first example (simply replace DELETE FROM with SELECT * FROM) shows that you are doing full table scans (bad) and using a DEPENDENT subquery which will be ran for almost every record in the outer table (which is bad as well).

1   PRIMARY app ALL 7   Using where
2   DEPENDENT SUBQUERY  <derived3>  ALL 7   Using where
3   DERIVED app ALL 7

To see that of the second one, you will have to translate the delete into a SELECT statement, something like this

SELECT * FROM app, (
SELECT app_package, MAX( id ) AS ID
FROM app
GROUP BY app_package
) AS A
WHERE A.ID > app.ID
AND A.app_package = app.app_package

which gives

1   PRIMARY <derived2>  ALL 4   
1   PRIMARY app ref 1   Using where
2   DERIVED app index   7

As you can see, this is one isn't using dependant subqueries and not doing full table scans. This will definitely run faster when the amount of data in the table grows.

Also not that a query plan is not represenative if the test data you are executing it agains is not simular than the real data. For instance, if an index doesn't have a lot of cardinality, the optimizer might choose not to use it and scan the table instead. So make sure you are testing it against a good dataset. — Tom Cannaerts, Nov 21 '13 at 20:04

Removing all duplicates except one - optimized queries

Are there further optimizations that could be made?

1 Answers1