164

I've got a couple of duplicates in a database that I want to inspect, so what I did to see which are duplicates, I did this:

SELECT relevant_field
FROM some_table
GROUP BY relevant_field
HAVING COUNT(*) > 1

This way, I will get all rows with relevant_field occuring more than once. This query takes milliseconds to execute.

Now, I wanted to inspect each of the duplicates, so I thought I could SELECT each row in some_table with a relevant_field in the above query, so I did like this:

SELECT *
FROM some_table 
WHERE relevant_field IN
(
    SELECT relevant_field
    FROM some_table
    GROUP BY relevant_field
    HAVING COUNT(*) > 1
)

This turns out to be extreeeemely slow for some reason (it takes minutes). What exactly is going on here to make it that slow? relevant_field is indexed.

Eventually I tried creating a view "temp_view" from the first query (SELECT relevant_field FROM some_table GROUP BY relevant_field HAVING COUNT(*) > 1), and then making my second query like this instead:

SELECT *
FROM some_table
WHERE relevant_field IN
(
    SELECT relevant_field
    FROM temp_view
)

And that works just fine. MySQL does this in some milliseconds.

Any SQL experts here who can explain what's going on?

Gabriele Petrioli
  • 191,379
  • 34
  • 261
  • 317
quano
  • 18,812
  • 25
  • 97
  • 108
  • what u exact want? want to delete Duplicate entries except one?? Suggestion: please Read [Self Join](http://www.devshed.com/c/a/MySQL/MySQL-Table-Joins/4/) – xkeshav May 26 '11 at 07:57
  • 1
    obviously is the group-by that is slow ... – ajreal May 26 '11 at 08:01
  • The first query executes in milliseconds (the one grouping and filtering with HAVING). It's only in combination with the other query which makes everything slow (it takes minutes). – quano May 26 '11 at 08:03
  • @diEcho, I want to find duplicates, inspect them, and delete some manually. – quano May 26 '11 at 08:06

11 Answers11

152

The subquery is being run for each row because it is a correlated query. One can make a correlated query into a non-correlated query by selecting everything from the subquery, like so:

SELECT * FROM
(
    SELECT relevant_field
    FROM some_table
    GROUP BY relevant_field
    HAVING COUNT(*) > 1
) AS subquery

The final query would look like this:

SELECT *
FROM some_table
WHERE relevant_field IN
(
    SELECT * FROM
    (
        SELECT relevant_field
        FROM some_table
        GROUP BY relevant_field
        HAVING COUNT(*) > 1
    ) AS subquery
)
quano
  • 18,812
  • 25
  • 97
  • 108
  • 4
    This worked amazingly well for me. I had another IN(subquery) within a IN(subquery), and it was taking more than 10 minutes, so long that I googled while I was waiting. Wrapping each subquery in SELECT * FROM () as you suggested reduced it to 2 seconds! – Liam Oct 08 '15 at 16:11
  • THANK YOU, I've been trying to figure out a good way to do this for a couple hours now. This worked perfectly. Wish I could give you more upvotes! This should definitely be the answer. – thaspius Jan 27 '16 at 18:25
  • Works perfectly. A query that took ~50secs to run is now instantaneous. Wish I could upvote more. Sometimes you cannot use joins so this is the right answer. – simon Apr 11 '16 at 09:22
  • I wonder why the optimizer considers queries with unions correlated... Anyway, this trick worked like magic – Brian Leishman Oct 26 '17 at 16:00
  • 10
    Could you please explain what makes that a correlated subquery? My understanding that subquery becomes correlated, when it uses a value that depends on the outer query. But in this example I cannot see any interdependencies. It'd give the same result for each row returned by the outer query. I have a similar example being implemented on MariaDB and I can see no performance hit (so far), so I'd like to see clearly, when this `SELECT *` wrapping is needed. – Bence Szalai Sep 25 '19 at 11:43
119

Rewrite the query into this

SELECT st1.*, st2.relevant_field FROM sometable st1
INNER JOIN sometable st2 ON (st1.relevant_field = st2.relevant_field)
GROUP BY st1.id  /* list a unique sometable field here*/
HAVING COUNT(*) > 1

I think st2.relevant_field must be in the select, because otherwise the having clause will give an error, but I'm not 100% sure

Never use IN with a subquery; this is notoriously slow.
Only ever use IN with a fixed list of values.

More tips

  1. If you want to make queries faster, don't do a SELECT * only select the fields that you really need.
  2. Make sure you have an index on relevant_field to speed up the equi-join.
  3. Make sure to group by on the primary key.
  4. If you are on InnoDB and you only select indexed fields (and things are not too complex) than MySQL will resolve your query using only the indexes, speeding things way up.

General solution for 90% of your IN (select queries

Use this code

SELECT * FROM sometable a WHERE EXISTS (
  SELECT 1 FROM sometable b
  WHERE a.relevant_field = b.relevant_field
  GROUP BY b.relevant_field
  HAVING count(*) > 1) 
Johan
  • 74,508
  • 24
  • 191
  • 319
  • 1
    You can also write that with `HAVING COUNT(*) > 1`. It's usually faster in MySQL. – ypercubeᵀᴹ May 26 '11 at 09:33
  • @ypercube, done for the bottom query, I think that for the top query it will alter the outcome. – Johan May 26 '11 at 09:37
  • @Johan: Since `st2.relevant_field` is not `NULL` (it's already included in the `ON` clause), it won't alter the result. – ypercubeᵀᴹ May 26 '11 at 09:41
  • @ypercube, so you can change count(afield) into count(*) **if** you're sure `afield` will never be `null`, got it. Thanks – Johan May 26 '11 at 09:46
  • Does this actually list all duplicates? Not just one of each duplicate. – quano May 26 '11 at 10:32
  • 1
    @quano, yes it lists **all** duplicates because the `group by` is on `st1.id`, not on `st1.relevant_field`. – Johan May 26 '11 at 10:44
  • Hi, any of you guys want to attempt to help me solve this: http://dba.stackexchange.com/questions/143383/user-variable-in-nested-subquery-not-getting-picked-up-in-complex-sort-of-2-ta ? – insaner Jul 09 '16 at 22:51
  • slow too. 47 seconds on query. – neobie Aug 15 '16 at 16:47
9
SELECT st1.*
FROM some_table st1
inner join 
(
    SELECT relevant_field
    FROM some_table
    GROUP BY relevant_field
    HAVING COUNT(*) > 1
)st2 on st2.relevant_field = st1.relevant_field;

I've tried your query on one of my databases, and also tried it rewritten as a join to a sub-query.

This worked a lot faster, try it!

ceteras
  • 3,360
  • 2
  • 19
  • 14
  • Yes, this will probably create a temp table with the group results, so it will be same speed as the view version. But the query plans should tell the truth. – ypercubeᵀᴹ May 26 '11 at 09:45
6

I have reformatted your slow sql query with www.prettysql.net

SELECT *
FROM some_table
WHERE
 relevant_field in
 (
  SELECT relevant_field
  FROM some_table
  GROUP BY relevant_field
  HAVING COUNT ( * ) > 1
 );

When using a table in both the query and the subquery, you should always alias both, like this:

SELECT *
FROM some_table as t1
WHERE
 t1.relevant_field in
 (
  SELECT t2.relevant_field
  FROM some_table as t2
  GROUP BY t2.relevant_field
  HAVING COUNT ( t2.relevant_field ) > 1
 );

Does that help?

plang
  • 5,446
  • 3
  • 25
  • 36
  • 1
    It does unfortunately not help. It executes just as slow. – quano May 26 '11 at 08:14
  • I have updated my answer, can you try again? Even if the group by is slow, it should be executed only once... – plang May 26 '11 at 08:23
  • I accidentally killed a live mysql server last time, so I'm afraid I can't try this right now. I'll have to set up a test database later. But I don't understand why this should affect the query. The HAVING statement should only apply to the query which it's within, shouldn't it? I really don't understand why the "real" query should affect the subquery. – quano May 26 '11 at 08:26
  • I found this: http://www.xaprb.com/blog/2006/04/30/how-to-optimize-subqueries-and-joins-in-mysql/ . I think this might be the solution. Will try when I get time. – quano May 26 '11 at 08:36
6

Subqueries vs joins

http://www.scribd.com/doc/2546837/New-Subquery-Optimizations-In-MySQL-6

Community
  • 1
  • 1
edze
  • 2,965
  • 1
  • 23
  • 29
  • I suspected something like this, that the subquery is being run for each row. – quano May 26 '11 at 08:56
  • Some MySQL Version even don't use an Index in IN. I've added another link. – edze May 26 '11 at 09:06
  • 1
    MySQL 6 is not stable yet, I would not recommend that for production! – Johan May 26 '11 at 09:38
  • 1
    I would not recommend it. But here is explained how it runs internally (4.1/5.x -> 6). This demonstrates some pitfalls of the current versions. – edze May 26 '11 at 11:19
4

Firstly you can find duplicate rows and find count of rows is used how many times and order it by number like this;

SELECT q.id,q.name,q.password,q.NID,(select count(*) from UserInfo k where k.NID= q.NID) as Count,
(
  CASE q.NID
  WHEN @curCode THEN
   @curRow := @curRow + 1
  ELSE
   @curRow := 1
  AND @curCode := q.NID
  END
 ) AS No
FROM UserInfo q,
(
  SELECT
   @curRow := 1,
   @curCode := ''
 ) rt
WHERE q.NID IN
(
    SELECT NID
    FROM UserInfo
    GROUP BY NID
    HAVING COUNT(*) > 1
) 

after that create a table and insert result to it.

create table CopyTable 
SELECT q.id,q.name,q.password,q.NID,(select count(*) from UserInfo k where k.NID= q.NID) as Count,
(
  CASE q.NID
  WHEN @curCode THEN
   @curRow := @curRow + 1
  ELSE
   @curRow := 1
  AND @curCode := q.NID
  END
 ) AS No
FROM UserInfo q,
(
  SELECT
   @curRow := 1,
   @curCode := ''
 ) rt
WHERE q.NID IN
(
    SELECT NID
    FROM UserInfo
    GROUP BY NID
    HAVING COUNT(*) > 1
) 

Finally, delete dublicate rows.No is start 0. Except fist number of each group delete all dublicate rows.

delete from  CopyTable where No!= 0;
harun ugur
  • 1,718
  • 18
  • 18
3

Try this

SELECT t1.*
FROM 
 some_table t1,
  (SELECT relevant_field
  FROM some_table
  GROUP BY relevant_field
  HAVING COUNT (*) > 1) t2
WHERE
 t1.relevant_field = t2.relevant_field;
NoNaMe
  • 6,020
  • 30
  • 82
  • 110
1

sometimes when data grow bigger mysql WHERE IN's could be pretty slow because of query optimization. Try using STRAIGHT_JOIN to tell mysql to execute query as is, e.g.

SELECT STRAIGHT_JOIN table.field FROM table WHERE table.id IN (...)

but beware: in most cases mysql optimizer works pretty well, so I would recommend to use it only when you have this kind of problem

0

This is similar to my case, where I have a table named tabel_buku_besar. What I need are

  1. Looking for record that have account_code='101.100' in tabel_buku_besar which have companyarea='20000' and also have IDR as currency

  2. I need to get all record from tabel_buku_besar which have account_code same as step 1 but have transaction_number in step 1 result

while using select ... from...where....transaction_number in (select transaction_number from ....), my query running extremely slow and sometimes causing request time out or make my application not responding...

I try this combination and the result...not bad...

`select DATE_FORMAT(L.TANGGAL_INPUT,'%d-%m-%y') AS TANGGAL,
      L.TRANSACTION_NUMBER AS VOUCHER,
      L.ACCOUNT_CODE,
      C.DESCRIPTION,
      L.DEBET,
      L.KREDIT 
 from (select * from tabel_buku_besar A
                where A.COMPANYAREA='$COMPANYAREA'
                      AND A.CURRENCY='$Currency'
                      AND A.ACCOUNT_CODE!='$ACCOUNT'
                      AND (A.TANGGAL_INPUT BETWEEN STR_TO_DATE('$StartDate','%d/%m/%Y') AND STR_TO_DATE('$EndDate','%d/%m/%Y'))) L 
INNER JOIN (select * from tabel_buku_besar A
                     where A.COMPANYAREA='$COMPANYAREA'
                           AND A.CURRENCY='$Currency'
                           AND A.ACCOUNT_CODE='$ACCOUNT'
                           AND (A.TANGGAL_INPUT BETWEEN STR_TO_DATE('$StartDate','%d/%m/%Y') AND STR_TO_DATE('$EndDate','%d/%m/%Y'))) R ON R.TRANSACTION_NUMBER=L.TRANSACTION_NUMBER AND R.COMPANYAREA=L.COMPANYAREA 
LEFT OUTER JOIN master_account C ON C.ACCOUNT_CODE=L.ACCOUNT_CODE AND C.COMPANYAREA=L.COMPANYAREA 
ORDER BY L.TANGGAL_INPUT,L.TRANSACTION_NUMBER`
Hilarius L. Doren
  • 757
  • 1
  • 12
  • 29
0

I find this to be the most efficient for finding if a value exists, logic can easily be inverted to find if a value doesn't exist (ie IS NULL);

SELECT * FROM primary_table st1
LEFT JOIN comparision_table st2 ON (st1.relevant_field = st2.relevant_field)
WHERE st2.primaryKey IS NOT NULL

*Replace relevant_field with the name of the value that you want to check exists in your table

*Replace primaryKey with the name of the primary key column on the comparison table.

Matt
  • 106
  • 5
0

It's slow because your sub-query is executed once for every comparison between relevant_field and your IN clause's sub-query. You can avoid that like so:

SELECT *
FROM some_table T1 INNER JOIN 
(
    SELECT relevant_field
    FROM some_table
    GROUP BY relevant_field
    HAVING COUNT(*) > 1
) T2 
USING(relevant_field)

This creates a derived table (in memory unless it's too large to fit) as T2, then INNER JOIN's it with T1. The JOIN happens one time, so the query is executed one time.

I find this particularly handy for optimising cases where a pivot is used to associate a bulk data table with a more specific data table and you want to produce counts of the bulk table based on a subset of the more specific one's related rows. If you can narrow down the bulk rows to <5% then the resulting sparse accesses will generally be faster than a full table scan.

ie you have a Users table (condition), an Orders table (pivot) and LineItems table (bulk) which references counts of Products. You want the sum of Products grouped by User in PostCode '90210'. In this case the JOIN will be orders of magnitude smaller than when using WHERE relevant_field IN( SELECT * FROM (...) T2 ), and therefore much faster, especially if that JOIN is spilling to disk!

Wil
  • 757
  • 9
  • 12