Find and Delete Duplicate rows in MySQL

Question

I'm having trouble finding duplicates in a database table with the following setup:

==========================================================================
| stock_id  | product_id  | store_id  | stock_qty  | updated_at          |
==========================================================================
| 9990      | 51          | 1         | 13         | 2014-10-25 16:30:01 |
| 9991      | 90          | 2         | 5          | 2014-10-25 16:30:01 |
| 9992      | 161         | 1         | 3          | 2014-10-25 16:30:01 |
| 9993      | 254         | 1         | 18         | 2014-10-25 16:30:01 |
| 9994      | 284         | 2         | 12         | 2014-10-25 16:30:01 |
| 9995      | 51          | 1         | 11         | 2014-10-25 17:30:02 |
| 9996      | 90          | 2         | 5          | 2014-10-25 17:30:02 |
| 9997      | 161         | 1         | 3          | 2014-10-25 17:30:02 |
| 9998      | 254         | 1         | 16         | 2014-10-25 17:30:02 |
| 9999      | 284         | 2         | 12         | 2014-10-25 17:30:02 |
==========================================================================

Stock updates are imported into this table every hour, I'm trying to find duplicate stock entries (any rows which have a matching product id and store id) so I can delete the oldest. The query below is my attempt, by comparing product ids and store ids on a join like this I can find one set of duplicates:

SELECT s.`stock_id`, s.`product_id`, s.`store_id`, s.`stock_qty`, s.`updated_at`
FROM `stock` s
INNER JOIN `stock` j ON s.`product_id`=j.`product_id` AND s.`store_id`=j.`store_id`
GROUP BY `stock_id`
HAVING COUNT(*) > 1
ORDER BY s.updated_at DESC, s.product_id ASC, s.store_id ASC, s.stock_id ASC;

While this query will work, it doesn't find ALL duplicates, only 1 set, which means if an import goes awry and isn't noticed until the morning, there's a possibility that we'll be left with tons of duplicate stock entries. My MySQL skills are sadly lacking and I'm at a complete loss about how to find and delete all duplicates in a fast, reliable manner.

Any help or ideas are welcome. Thanks

hi @jbutler483, great question, the system we're working in has an ORM that requires the existence of a single primary key. We really on the ORM in other ways for scaffolding the CRUD management via the same system. — Herbage Onion, Nov 27 '14 at 16:04
Plus the database will keep the oldest data, which is not the required functionality :) — Artjoman, Nov 27 '14 at 16:05
maybe a command like [this describes?](http://stackoverflow.com/a/7436991/3436942) or [this](http://stackoverflow.com/questions/15552152/sybase-sql-remove-semi-duplicates-from-query-results) — jbutler483, Nov 27 '14 at 16:11
@HerbageOnion I can't see why it matters that there's a single PK - as long as you can have a unique key constraint on (product_id,store_id). Then `INSERT... ON DUPLICATE KEY UPDATE` will solve the problem — Strawberry, Nov 27 '14 at 17:06

score 2 · Answer 1 · answered Aug 03 '16 at 12:21

You can use this query:

DELETE st FROM stock st, stock st2 
WHERE st.stock_id < st2.stock_id AND st.product_id = st2.product_id AND 
st.store_id = st2.store_id;

This query will delete older record having same product_id and store_id and will keep latest record.

Hartmut Holzgraefe · Accepted Answer · 2014-11-27T16:34:18.373

A self join on store_id, product_id and 'is older' in combination with DISTINCT should give you all rows where also a newer version exists:

> SHOW CREATE TABLE stock;
CREATE TABLE `stock` (
  `stock_id` int(11) NOT NULL,
  `product_id` int(11) DEFAULT NULL,
  `store_id` int(11) DEFAULT NULL,
  `stock_qty` int(11) DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  PRIMARY KEY (`stock_id`)

> select * from stock;
+----------+------------+----------+-----------+---------------------+
| stock_id | product_id | store_id | stock_qty | updated_at          |
+----------+------------+----------+-----------+---------------------+
|        1 |          1 |        1 |         1 | 2001-01-01 12:00:00 |
|        2 |          2 |        2 |         1 | 2001-01-01 12:00:00 |
|        3 |          2 |        2 |         1 | 2002-01-01 12:00:00 |
+----------+------------+----------+-----------+---------------------+


> SELECT DISTINCT s1.stock_id, s1.store_id, s1.product_id, s1.updated_at   
   FROM stock s1   JOIN stock s2     
     ON s1.store_id = s2.store_id    
    AND s1.product_id = s2.product_id    
    AND s1.updated_at < s2.updated_at;
+----------+----------+------------+---------------------+
| stock_id | store_id | product_id | updated_at          |
+----------+----------+------------+---------------------+
|        2 |        2 |          2 | 2001-01-01 12:00:00 |
+----------+----------+------------+---------------------+

> DELETE stock FROM stock 
               JOIN stock s2  ON stock.store_id = s2.store_id  
                             AND stock.product_id = s2.product_id 
                             AND stock.updated_at < s2.updated_at;
Query OK, 1 row affected (0.02 sec)

> select * from stock;
+----------+------------+----------+-----------+---------------------+
| stock_id | product_id | store_id | stock_qty | updated_at          |
+----------+------------+----------+-----------+---------------------+
|        1 |          1 |        1 |         1 | 2001-01-01 12:00:00 |
|        3 |          2 |        2 |         1 | 2002-01-01 12:00:00 |
+----------+------------+----------+-----------+---------------------+

Which can be combined with `DELETE` by only selecting `s1.stock_id` and combining that query with `DELETE` as ` DELETE FROM stock WHERE stock_id IN (SELECT DISTINCT s1.stock_id FROM ...) ` — Hartmut Holzgraefe, Nov 27 '14 at 16:16
Thanks for your help Hartmut, did you mean to have `s1.store_id = s2.store_id` instead of `s1.stock_id = s2.stock_id`? I've tried that and I don't get any results... — Herbage Onion, Nov 27 '14 at 16:17
Just to be clear, there are no results but also no errors. Thanks for your help so far. — Herbage Onion, Nov 27 '14 at 16:19
I tried it again, I think the sq is another typo, so I used s1, still no results though — Herbage Onion, Nov 27 '14 at 16:21
i've rewritten the answer, now with test data and actual query results, also added a DELETE example ... does this work for you now? — Hartmut Holzgraefe, Nov 27 '14 at 16:35
It does work, I'm just going to make sure its accurate on the data in my system and I'll come back and mark the answer! — Herbage Onion, Nov 27 '14 at 16:54
Hi, so the first query I had was returning 4741 rows, your query is returning 3442, which has me a little worried. — Herbage Onion, Nov 27 '14 at 17:19
Your query also returned the latest entry while mine only filters out the older duplicates: ` — Hartmut Holzgraefe, Nov 27 '14 at 17:28

Artjoman · Answer 3 · 2014-11-27T17:03:04.990

Or you can use a stored Procedure:

DELIMITER //
DROP PROCEDURE IF EXISTS removeDuplicates;


 CREATE PROCEDURE removeDuplicates(
   stockID INT
 )
 BEGIN


    DECLARE stockToKeep INT;
    DECLARE storeID INT;
    DECLARE productID INT;

 -- gets the store and product value
 SELECT DISTINCT store_id, product_id
 FROM stock
  WHERE stock_id = stockID  
  LIMIT 1
 INTO
  storeID, productID;

 SELECT stock_id
 FROM stock
  WHERE product_id = productID AND store_id = storeID  
  ORDER BY updated_at DESC
  LIMIT 1
 INTO
  stockToKeep;

    DELETE FROM stock 
    WHERE product_id = productID AND store_id = storeID 
    AND stock_id != stockToKeep;
END //
DELIMITER ;

And afterwards call it for every pair of the product id and store id via a cursor procedure: DELIMITER // CREATE PROCEDURE updateTable() BEGIN DECLARE done BOOLEAN DEFAULT FALSE; DECLARE stockID INT UNSIGNED; DECLARE cur CURSOR FOR SELECT DISTINCT stock_id FROM stock; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;

  OPEN cur;

  testLoop: LOOP
    FETCH cur INTO stockID;
    IF done THEN
      LEAVE testLoop;
    END IF;
    CALL removeDuplicates(stockID);
  END LOOP testLoop;

  CLOSE cur;
END//
DELIMITER ;

And then just call the second procedure

CALL updateTable();

Hi thanks for the detailed answer, I have to be honest, this scares the crap out of me, I don't doubt you know your stuff and that's probably one hell of a solution but I need to be able to maintain the solution I implement and I don't think I could with that :( Thanks very much for your time, I'll be deconstructing your solution to learn more about stored procedures!! :) — Herbage Onion, Nov 27 '14 at 17:14

Find and Delete Duplicate rows in MySQL

3 Answers3