Given MySQL DB with duplicate rows based on criteria, list all the OLDER rows, for external processing (php)

Question

Situation:

Old scripts added rows to a table without deleting existing rows.
Need to discover "duplicate" rows (based on matching two fields).
For each set of duplicate rows, sort by ids and return all but the newest one (highest id).
Each row has an associated external file, so can't simply delete the older rows - need to return a list of all the older rows, which will then be processed by a php script.

Example:

TABLE mytable:

ID  A   B      Filename
 1  10  abc    aa.png
 2  11  dddd   bb.xml
 3  10  abc    cc.png
 4  10  dddd   dd.png
 5  10  abc    ee.xml
 6  11  dddd   ff.xml

Rows with IDs 1 & 3 & 5 are duplicates (both A and B match). Similarly, 2 & 6 are duplicates. Return list (1, 2, 3) - these are the "older" rows that need to be processed.

Even better: return a set of records, containing 'ID' and 'Filename' for those rows.

My primary question is an SQL query that does this, though it would also be useful to me to see how to use the result of that query in php.

There are existing stackoverflow posts related to deleting duplicate rows, but the ones I found delete the rows directly. This won't work for me, as I need to have the external php script delete the corresponding external files:

Deleting Duplicate Rows from MySql Table

How to delete duplicate records in mysql database?

How to delete all the duplicate records in PHP/Mysql

IMPORTANT: The other posts which I quote don't bother to distinguish newer from older; they are about removing fully duplicate records, but that is not my situation. I have records which are partially duplicates; that is, several records match the specified criteria, but there is important information in other fields, hence I have to know which is newest (highest id) for each value of criteria; those are the ones to keep.

Use the deletion query select part and write it into a temporary table. You can then iterate over that table. — Norbert, Jul 12 '15 at 23:44
@NorbertvanNobelen - No, that doesn't explain to me how to make a selection that includes only the ***older*** duplicates. — ToolmakerSteve, Jul 12 '15 at 23:46

Jaylen · Accepted Answer · 2015-07-13T00:17:11.730

I would try this "make sure you test the code before to apply it on production data"

Assuming you have lots of data, I would create temporary table of the data that you want to keep so you can perform the operation fast.

    -- Generate a list of the IDs to keep

    CREATE TEMPORARY TABLE keepers (KEY(ID)) ENGINE = MEMORY
    SELECT A, B MIN(ID) AS ID
    FROM table
    GROUP BY A, B;

    -- Delete the records that you do not wish to keep
    DELETE FROM table
    WHERE NOT EXISTS (SELECT 1 FROM keepers WHERE ID = table.ID);

If the DELETE query does not work "return an error" about the sub query, you can try this instead of the DELETE query.

      CREATE TEMPORARY TABLE deleteme (KEY(ID)) ENGINE = MEMORY
      SELECT ID FROM table
      WHERE NOT EXISTS (SELECT 1 FROM keepers WHERE ID = table.ID);

DELETE t.* FROM table AS t
INNER JOIN deleteme AS d ON d.ID = t.ID;

Norbert · Answer 2 · 2015-07-13T00:50:02.953

0

To get the data: Select the records you want to keep (inner query) and join back on itself (outer query) keeping all records and using the dummyfield to find the to be deleted records.

CREATE TEMPORARY TABLE delete_these AS
SELECT *
FROM table a 
LEFT JOIN (SELECT MAX(id) as non_deletion_id, 1 AS dummyfield, 
  FROM table a
  GROUP BY your two fields) b ON non_deletion_id=a.id
WHERE dummyfield IS NULL;

edited Jul 13 '15 at 00:50

answered Jul 12 '15 at 23:54

Norbert

6,026
3
17
40

Thx for your quick response. What is the benefit to creating a temporary table, rather than simply performing a query that returns the information (which you are putting into the temporary table)? – ToolmakerSteve Jul 12 '15 at 23:59
Habit. I first test my results in a mysql session and instead of getting loads of data back to the display, I can check the data more selective (prepared test cases to check for mistakes). P.S. What data do you need to check in php? Depending on the checks, they could be placed in the query. – Norbert Jul 13 '15 at 00:03
OK, that makes sense - safer that way. The php just needs to extract filename field per id, as it has access to the storage area containing those external files. – ToolmakerSteve Jul 13 '15 at 00:10
syntax error - `missing closing parenthesis` at `dummyfield`. I've been trying variations on your syntax, without success. Reading up on the constructs you are using, to see if I can figure out what you intended. Perhaps an explanatory sentence or two? – ToolmakerSteve Jul 13 '15 at 00:21
In particular, `AS dummyfield as non_deletion_id` - not seeing how there can be two `AS` in a row. Also not sure whether your *lowercase* `table` is the SQL keyword `TABLE` or is `mytable`. – ToolmakerSteve Jul 13 '15 at 00:22
Typo: See the correction: I added dummyfield as last one, not paying attention to the alias for the non_deletion_id – Norbert Jul 13 '15 at 00:49

Given MySQL DB with duplicate rows based on criteria, list all the OLDER rows, for external processing (php)

2 Answers2