Delete duplicate records without creating a temporary table

Question

I have a table with many duplicate records:

shop
ID     tax_id
1      10
1      10
1      11
2      10
2      12
2      10
2      10

I want to delete all duplicate records without creating a temporary table. After the update query, the table should look like:

shop
ID     tax_id
1      10
1      11
2      10
2      12

what is dublicate? if id/tax_id has the same value or if you have multiple rows with the same ID (ID should be PK, which should be unique -- or your "ID" is an FK) — Rufinus, Dec 21 '11 at 12:50
It's temporary. It takes space temporarily. Most people can live with this. :-) — Sergio Tulentsev, Dec 21 '11 at 13:36
@Tomalak, we actually do not know how many (distinct values/values) there are in the table. We don't know if there are indexes (I believe there are not). If the table is large, it is hardly usable without indexes. If it is small - a temporary table would be the simpliest solution. I do not see any reasons not to use them, except for permissions. — newtover, Dec 21 '11 at 23:36
@SergeiTulentsev: If it's a 64GB table, you probably can't live with it. — Lightness Races in Orbit, Dec 22 '11 at 09:52
@TomalakGeret'kal if it's a 64GB source table, you're gonna have trouble anyway. — Sergio Tulentsev, Dec 22 '11 at 12:13
@SergeiTulentsev: Yes, but less trouble if you don't copy the whole thing. — Lightness Races in Orbit, Dec 22 '11 at 12:17

score 5 · Accepted Answer · answered Dec 21 '11 at 12:58

5

Here's an in-place solution (but not one-liner)

Find out max id:

select max(id) as maxid 
  from shop;

Remember this value. Let's say it equals to 1000;

Re-insert unique values, with offset:

insert into shop (id, tax_id) 
select distinct id + 1000, tax_id 
  from shop;

Drop old values:

delete from shop
  where id <= 1000;

Restore normal ids:

update shop
  set id = id - 1000;

PROFIT!

answered Dec 21 '11 at 12:58

Sergio Tulentsev

226,338
43
373
367

So far this solution is the only one that matches all criteria. You should upvote it instead. :-) – Sergio Tulentsev Dec 21 '11 at 13:14
Duplicating the entries isn't exactly good practice or practical, imagine implementing this on a table with more than a small number of records The INSERT statement is indiscriminate of tax_id and therefore will only store two (possibly wrong depending on whether the table is sorted or not) entries from the example. – CBusBus Dec 21 '11 at 13:32
This is not kind of operation you run 500 times/sec. This is data corruption fixup. Even if it takes several seconds, it should be acceptable (given the circumstances). I don't know what indiscrimination you're talking about, but this solution gives precisely the answer OP is looking for. – Sergio Tulentsev Dec 21 '11 at 13:34
This might as well be a temporary table. – Lightness Races in Orbit Dec 21 '11 at 13:35
@TomalakGeret'kal Yet it is not. :-) Maybe user doesn't have permissions to create tables. – Sergio Tulentsev Dec 21 '11 at 13:38
Admittedly it's not clear, but it seems highly likely that the OP wishes to remove duplicate _in-place_. – Lightness Races in Orbit Dec 21 '11 at 13:51
That's basically what my code is doing. Reliably and within requirements. You're welcome to post your own solution. :-) – Sergio Tulentsev Dec 21 '11 at 13:53
As stated before this is terrible practice and would likely be grounds for disciplinary, if not redundancy but besides that you're making assumptions as to what the OP intends on using this for. It might run every week, day, hour, minute... That said I'm making the assumption that the code is for commercial use. – CBusBus Dec 21 '11 at 14:06
This looks more as an interview question to me. Especially the part with "no temp table". In a commercial environment there would be compount unique index to prevent the problem from even happening. – Sergio Tulentsev Dec 21 '11 at 14:09
"looks" isn't necessarily "is", besides would you give someone a job if they provided this as a solution. Also you'd expect a compound unique index but that doesn't me there will be one, see almost every DB related article on thedailywtf.com – CBusBus Dec 21 '11 at 14:22
I consider this solution to be perfectly acceptable, given the requirements as I understand them. Yes, I would hire that guy, or, rather, I would not not hire him. – Sergio Tulentsev Dec 21 '11 at 14:27
I consider the solution rather fragile, since it breaks the rest of consistency of the table. – newtover Dec 21 '11 at 23:21
@newtover could you elaborate on that? – Sergio Tulentsev Dec 22 '11 at 09:55
I just dislike solutions that imply id modifications. – newtover Dec 22 '11 at 10:23
Me too, I would do it via temp table. But this option was not available. – Sergio Tulentsev Dec 22 '11 at 11:13

CBusBus · Answer 2 · 2013-06-19T23:46:57.697

Working solution.

//Sql query to find duplicates
SELECT id, tax_id, count(*) - 1 AS cnt 
  FROM shop 
  GROUP BY id
  HAVING cnt > 1

--- res

+------+--------+-----+
| id   | tax_id | cnt |
+------+--------+-----+
|    1 |     10 |   2 |
|    2 |     10 |   3 |
+------+--------+-----+


//Iterate through results with your language of choice
DELETE 
  FROM shop 
  WHERE id=<res id> 
    AND tax_id=<res tax_id> 
  LIMIT <cnt - 1>

---res (iterated)

+------+--------+
| id   | tax_id |
+------+--------+
|    1 |     10 |
|    1 |     11 |
|    2 |     12 |
|    2 |     10 |
+------+--------+

The two queries will require a small piece of php in order to carry out the deletes

$res = mysql_query("SELECT id, tax_id, count(*) - 1 AS cnt 
                      FROM shop 
                      GROUP BY id
                      HAVING cnt > 1")
while($row = mysql_fetch_assoc($res)){
    mysql_query("DELETE 
                   FROM shop 
                   WHERE id=".$row['id']."
                       AND tax_id=". $row['tax_id']."
                   LIMIT ".$row['cnt'] -1 . ");
}

Edit: Revisited this recently, for what it's worth, here's an alternative solution using a temporary column, removing the need for a scripting language.

ALTER TABLE shop ADD COLUMN place INT;

SET @i = 1

UPDATE shop SET place = @i:= @i + 1;

DELETE FROM shop WHERE place NOT IN (SELECT place FROM items GROUP BY id, tax_id);

ALTER TABLE shop DROP COLUMN place;

Having memory limitations I would go with something similar. — newtover, Dec 21 '11 at 23:26
Thanks Sergei, you're constructive, albeit opprobrious comments, gave me the necessary drive. — CBusBus, Dec 22 '11 at 16:09

score 3 · Answer 3 · edited Oct 11 '12 at 13:22

Maybe this can help:

$query="SELECT * FROM shop ORDER BY id";
$rez=$dbh->query($query);
$multi=$rez->fetchAll(PDO::FETCH_ASSOC);
foreach ($multi as $key=>$row){
$rest=array_slice($multi,$key+1);
foreach ($rest as $rest){
    if(($row['id']==$rest['id']) && ($row['tax_id']==$rest['tax_id'])){
        $dbh->query("DELETE FROM shop WHERE id={$rest['id']} and tax_id=     {$rest['tax_id']}");

    }
}

}

First foreach iterates every row, and second one doing comparation. I'm using PDO, but of course, you can do it in the procedural way.

score 3 · Answer 4 · edited Dec 21 '11 at 13:34

3

First off, you can prevent this by creating a Unique Index on those two fields, for future reference.

As for the solution, create a new table shopnew with the same structure in mysql, or just delete every record from the table when the recordList is generated (make sure you have a backup!):

//Get every record from mysql
$sSQL = "Select ID, tax_id from shop";
$oRes = mysql_query($sSQL);
$aRecordList = array();
while($aRow = mysql_fetch_assoc($oRes)){
      //If record is a duplicate, it will be 'overwritten' 
      $aRecordList[$aRow['id'].".".$aRow['tax_id']] =1; 
}
//You could delete every record from shop here, if you dont want an additional table
//recordList now only contains unique records
foreach($aRecordList as $sRecord=>$bSet){
   $aExpRecord = explode(".",$sRecord);
   mysql_query("INSERT INTO shopnew set id=".$aExpRecord[0].", tax_id = ".$aExpRecord[1]
 }

edited Dec 21 '11 at 13:34

Lightness Races in Orbit

378,754
76
643
1,055

answered Dec 21 '11 at 12:51

Derk Arts

3,432
2
18
36

In the example, neither ID nor tax_id are unique. Each shop has multiple tax_ids associated with it. – Grexis Dec 21 '11 at 13:00
2

Yes but you can creat a Unique index on both! That is what you are looking for. So two fields combined can not be already present. Check: http://www.mysqlfaqs.net/mysql-faqs/Indexes/Unique-Key-or-Index/How-to-create-multi-column-unique-key-or-index-in-MySQL – Derk Arts Dec 21 '11 at 13:03
Yeah, so? His solution maintains uniqueness of pair (id, tax_id) (at least, as far as I can tell). Though it violates the condition of not creating new table. – Sergio Tulentsev Dec 21 '11 at 13:03
Ah, I read your solution wrong. Now it makes sense :) Line 7 is missing a closing bracket. – Grexis Dec 21 '11 at 13:06
Right, its a bit harder to code inside the SO textarea, should be doing that in my IDE :) – Derk Arts Dec 21 '11 at 13:08
The OP asked for no temporary tables; moving the temporary table into PHP variable memory is probably not what he wants. – Lightness Races in Orbit Dec 21 '11 at 13:34
Also @saratis creates a new permanent table (which is good in everyday life, but goes against requirements here) – Sergio Tulentsev Dec 21 '11 at 13:39
As I included in the answer it is perfectly suitable to use with one table as long as you `DELETE FROM shop` before doing the new insert. – Derk Arts Dec 21 '11 at 14:36

score 2 · Answer 5 · answered Dec 21 '11 at 23:10

In reality the question with its current limitations is quite a tricky challenge. I thought about the solution the whole evening (understanding that the solution will never be of use). I would not use the solution in wild nature, I just tried to find out if it is possible to do using MySQL only.

The question in my formulation: is it possible to write a series of DELETE statements that will remove duplicate rows from a two-column table without unique constraints?

Problems:

rows do not have an identity key or a primary key, so one should think up a way to refer to a single row that should stay
we will need to group rows somehow, that is to apply an order and then condition, but the form of DELETE that supports ORDER BY can only have a WHERE clause and does not support HAVING. That is the order is applied after a condition is met.
we would not need to sort rows if the values would be arranged by the clustered primary key, but we do not have one.

Suppose we have a table:

CREATE TABLE  `tablename` (
  `a_id` int(10) unsigned NOT NULL,
  `b_id` int(10) unsigned NOT NULL,
  KEY `Index_1` (`a_id`,`b_id`)
) ENGINE=InnoDB COLLATE utf8_bin;

I added a key (not UNIQUE or PRIMARY) to make lookups faster and hoping to use it in groupings.

You can feed the table with some values:

INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4);
INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4);
INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4);

As a side effect, the key became a coverage index and when we make SELECTs from the table the values displayed are sorted, but when we make deletions the values are read in the order we inserted them.

Now, let's look at the following query:

SELECT @c, @a_id as a, @b_id as b, a_id, b_id
FROM tablename, (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) >= 1
;

And its result:

@c, a, b, a_id, b_id
 1, 1, 1,    1,    1
 2, 1, 1,    1,    1
 3, 1, 1,    1,    1
 1, 1, 4,    1,    4
 2, 1, 4,    1,    4
 3, 1, 4,    1,    4
 1, 2, 2,    2,    2
 2, 2, 2,    2,    2
 3, 2, 2,    2,    2
 1, 2, 3,    2,    3
 2, 2, 3,    2,    3
 3, 2, 3,    2,    3

The results are automatically sorted using Index_1, and duplicate pairs (a_id, b_id) are enumerated in column @c. That is our task now is to remove all rows where @c > 1. The only problem we have is to force MySQL use Index_1 on deletion which is rather tricky without applying an additional conditions. But we can do this by using an equality check or multiple equality checks on a_id:

DELETE FROM t
USING tablename t FORCE INDEX (Index_1)
JOIN (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE a_id IN (1)
  AND (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) > 1;

DELETE FROM t
USING tablename t FORCE INDEX (Index_1)
JOIN (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE a_id IN (2)
  AND (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) > 1;

SELECT * FROM tablename t;

a_id, b_id
   1,    1
   1,    4
   2,    2
   2,    3

I can not put all possible a_id in IN() because MySQL will understand that the index is useless in this case and the query will not remove all duplicates (only adjacent), but having say 10 different a_id I can remove duplicates in two DELETE statements, each IN will have 5 explicit ids.

Hope, this might be useful to someone =)

Delete duplicate records without creating a temporary table

5 Answers5

Linked