In reality the question with its current limitations is quite a tricky challenge. I thought about the solution the whole evening (understanding that the solution will never be of use). I would not use the solution in wild nature, I just tried to find out if it is possible to do using MySQL only.
The question in my formulation: is it possible to write a series of DELETE statements that will remove duplicate rows from a two-column table without unique constraints?
Problems:
- rows do not have an identity key or a primary key, so one should think up a way to refer to a single row that should stay
- we will need to group rows somehow, that is to apply an order and then condition, but the form of
DELETE
that supports ORDER BY
can only have a WHERE
clause and does not support HAVING
. That is the order is applied after a condition is met.
- we would not need to sort rows if the values would be arranged by the clustered primary key, but we do not have one.
Suppose we have a table:
CREATE TABLE `tablename` (
`a_id` int(10) unsigned NOT NULL,
`b_id` int(10) unsigned NOT NULL,
KEY `Index_1` (`a_id`,`b_id`)
) ENGINE=InnoDB COLLATE utf8_bin;
I added a key (not UNIQUE or PRIMARY) to make lookups faster and hoping to use it in groupings.
You can feed the table with some values:
INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4);
INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4);
INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4);
As a side effect, the key became a coverage index and when we make SELECTs from the table the values displayed are sorted, but when we make deletions the values are read in the order we inserted them.
Now, let's look at the following query:
SELECT @c, @a_id as a, @b_id as b, a_id, b_id
FROM tablename, (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) >= 1
;
And its result:
@c, a, b, a_id, b_id
1, 1, 1, 1, 1
2, 1, 1, 1, 1
3, 1, 1, 1, 1
1, 1, 4, 1, 4
2, 1, 4, 1, 4
3, 1, 4, 1, 4
1, 2, 2, 2, 2
2, 2, 2, 2, 2
3, 2, 2, 2, 2
1, 2, 3, 2, 3
2, 2, 3, 2, 3
3, 2, 3, 2, 3
The results are automatically sorted using Index_1
, and duplicate pairs (a_id, b_id)
are enumerated in column @c
. That is our task now is to remove all rows where @c > 1
. The only problem we have is to force MySQL use Index_1
on deletion which is rather tricky without applying an additional conditions. But we can do this by using an equality check or multiple equality checks on a_id
:
DELETE FROM t
USING tablename t FORCE INDEX (Index_1)
JOIN (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE a_id IN (1)
AND (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) > 1;
DELETE FROM t
USING tablename t FORCE INDEX (Index_1)
JOIN (SELECT @a_id:=0, @b_id:=0, @c:=0) as init
WHERE a_id IN (2)
AND (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) > 1;
SELECT * FROM tablename t;
a_id, b_id
1, 1
1, 4
2, 2
2, 3
I can not put all possible a_id
in IN()
because MySQL will understand that the index is useless in this case and the query will not remove all duplicates (only adjacent), but having say 10 different a_id
I can remove duplicates in two DELETE statements, each IN will have 5 explicit ids.
Hope, this might be useful to someone =)