2

I'm having trouble finding duplicates in a database table with the following setup:

==========================================================================
| stock_id  | product_id  | store_id  | stock_qty  | updated_at          |
==========================================================================
| 9990      | 51          | 1         | 13         | 2014-10-25 16:30:01 |
| 9991      | 90          | 2         | 5          | 2014-10-25 16:30:01 |
| 9992      | 161         | 1         | 3          | 2014-10-25 16:30:01 |
| 9993      | 254         | 1         | 18         | 2014-10-25 16:30:01 |
| 9994      | 284         | 2         | 12         | 2014-10-25 16:30:01 |
| 9995      | 51          | 1         | 11         | 2014-10-25 17:30:02 |
| 9996      | 90          | 2         | 5          | 2014-10-25 17:30:02 |
| 9997      | 161         | 1         | 3          | 2014-10-25 17:30:02 |
| 9998      | 254         | 1         | 16         | 2014-10-25 17:30:02 |
| 9999      | 284         | 2         | 12         | 2014-10-25 17:30:02 |
==========================================================================

Stock updates are imported into this table every hour, I'm trying to find duplicate stock entries (any rows which have a matching product id and store id) so I can delete the oldest. The query below is my attempt, by comparing product ids and store ids on a join like this I can find one set of duplicates:

SELECT s.`stock_id`, s.`product_id`, s.`store_id`, s.`stock_qty`, s.`updated_at`
FROM `stock` s
INNER JOIN `stock` j ON s.`product_id`=j.`product_id` AND s.`store_id`=j.`store_id`
GROUP BY `stock_id`
HAVING COUNT(*) > 1
ORDER BY s.updated_at DESC, s.product_id ASC, s.store_id ASC, s.stock_id ASC;

While this query will work, it doesn't find ALL duplicates, only 1 set, which means if an import goes awry and isn't noticed until the morning, there's a possibility that we'll be left with tons of duplicate stock entries. My MySQL skills are sadly lacking and I'm at a complete loss about how to find and delete all duplicates in a fast, reliable manner.

Any help or ideas are welcome. Thanks

Herbage Onion
  • 181
  • 4
  • 15
  • 1
    why aren't you setting them as a composite PK? – jbutler483 Nov 27 '14 at 15:56
  • hi @jbutler483, great question, the system we're working in has an ORM that requires the existence of a single primary key. We really on the ORM in other ways for scaffolding the CRUD management via the same system. – Herbage Onion Nov 27 '14 at 16:04
  • Plus the database will keep the oldest data, which is not the required functionality :) – Artjoman Nov 27 '14 at 16:05
  • maybe a command like [this describes?](http://stackoverflow.com/a/7436991/3436942) or [this](http://stackoverflow.com/questions/15552152/sybase-sql-remove-semi-duplicates-from-query-results) – jbutler483 Nov 27 '14 at 16:11
  • @HerbageOnion I can't see why it matters that there's a single PK - as long as you can have a unique key constraint on (product_id,store_id). Then `INSERT... ON DUPLICATE KEY UPDATE` will solve the problem – Strawberry Nov 27 '14 at 17:06

3 Answers3

2

You can use this query:

DELETE st FROM stock st, stock st2 
WHERE st.stock_id < st2.stock_id AND st.product_id = st2.product_id AND 
st.store_id = st2.store_id;

This query will delete older record having same product_id and store_id and will keep latest record.

Kruti Patel
  • 1,422
  • 2
  • 23
  • 36
1

A self join on store_id, product_id and 'is older' in combination with DISTINCT should give you all rows where also a newer version exists:

> SHOW CREATE TABLE stock;
CREATE TABLE `stock` (
  `stock_id` int(11) NOT NULL,
  `product_id` int(11) DEFAULT NULL,
  `store_id` int(11) DEFAULT NULL,
  `stock_qty` int(11) DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  PRIMARY KEY (`stock_id`)

> select * from stock;
+----------+------------+----------+-----------+---------------------+
| stock_id | product_id | store_id | stock_qty | updated_at          |
+----------+------------+----------+-----------+---------------------+
|        1 |          1 |        1 |         1 | 2001-01-01 12:00:00 |
|        2 |          2 |        2 |         1 | 2001-01-01 12:00:00 |
|        3 |          2 |        2 |         1 | 2002-01-01 12:00:00 |
+----------+------------+----------+-----------+---------------------+


> SELECT DISTINCT s1.stock_id, s1.store_id, s1.product_id, s1.updated_at   
   FROM stock s1   JOIN stock s2     
     ON s1.store_id = s2.store_id    
    AND s1.product_id = s2.product_id    
    AND s1.updated_at < s2.updated_at;
+----------+----------+------------+---------------------+
| stock_id | store_id | product_id | updated_at          |
+----------+----------+------------+---------------------+
|        2 |        2 |          2 | 2001-01-01 12:00:00 |
+----------+----------+------------+---------------------+

> DELETE stock FROM stock 
               JOIN stock s2  ON stock.store_id = s2.store_id  
                             AND stock.product_id = s2.product_id 
                             AND stock.updated_at < s2.updated_at;
Query OK, 1 row affected (0.02 sec)

> select * from stock;
+----------+------------+----------+-----------+---------------------+
| stock_id | product_id | store_id | stock_qty | updated_at          |
+----------+------------+----------+-----------+---------------------+
|        1 |          1 |        1 |         1 | 2001-01-01 12:00:00 |
|        3 |          2 |        2 |         1 | 2002-01-01 12:00:00 |
+----------+------------+----------+-----------+---------------------+
Hartmut Holzgraefe
  • 2,585
  • 12
  • 14
1

Or you can use a stored Procedure:

DELIMITER //
DROP PROCEDURE IF EXISTS removeDuplicates;


 CREATE PROCEDURE removeDuplicates(
   stockID INT
 )
 BEGIN


    DECLARE stockToKeep INT;
    DECLARE storeID INT;
    DECLARE productID INT;

 -- gets the store and product value
 SELECT DISTINCT store_id, product_id
 FROM stock
  WHERE stock_id = stockID  
  LIMIT 1
 INTO
  storeID, productID;

 SELECT stock_id
 FROM stock
  WHERE product_id = productID AND store_id = storeID  
  ORDER BY updated_at DESC
  LIMIT 1
 INTO
  stockToKeep;

    DELETE FROM stock 
    WHERE product_id = productID AND store_id = storeID 
    AND stock_id != stockToKeep;
END //
DELIMITER ;

And afterwards call it for every pair of the product id and store id via a cursor procedure: DELIMITER // CREATE PROCEDURE updateTable() BEGIN DECLARE done BOOLEAN DEFAULT FALSE; DECLARE stockID INT UNSIGNED; DECLARE cur CURSOR FOR SELECT DISTINCT stock_id FROM stock; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done := TRUE;

  OPEN cur;

  testLoop: LOOP
    FETCH cur INTO stockID;
    IF done THEN
      LEAVE testLoop;
    END IF;
    CALL removeDuplicates(stockID);
  END LOOP testLoop;

  CLOSE cur;
END//
DELIMITER ;

And then just call the second procedure

CALL updateTable();
Artjoman
  • 286
  • 1
  • 9
  • 1
    Hi thanks for the detailed answer, I have to be honest, this scares the crap out of me, I don't doubt you know your stuff and that's probably one hell of a solution but I need to be able to maintain the solution I implement and I don't think I could with that :( Thanks very much for your time, I'll be deconstructing your solution to learn more about stored procedures!! :) – Herbage Onion Nov 27 '14 at 17:14