Delete duplicate rows but maintain many to many relationship

Question

I'm saving the results from my Tensorflow image classifier in a SQL database. I have 3 tables. Images, Categories and a table joining the two with weight variables. Some images have no relationships and some have lots.

The problem is that I have duplicate rows in the image table that need to be removed. But I need to preserve the many to many relationships, if the duplicated image has one or more.

Here's an example:

Table name: my_images

+----+------------+-----------------+
| ID | image_path | image_filename  |
+----+------------+-----------------+
|  1 | Film 1     | Film 1 001.jpg  |
|  2 | Film 1     | Film 1 001.jpg  |
|  3 | Film 1     | Film 1 002.jpg  |
|  4 | Film 1     | Film 1 002.jpg  |
|  5 | Film 1     | Film 1 003.jpg  |
|  6 | Film 1     | Film 1 003.jpg  |
+----+------------+-----------------+

Table name: my_terms

+---------+------------+
| term_id | term_name  |
+---------+------------+
|       1 | cat        |
|       2 | dog        |
|       3 | automobile |
+---------+------------+

Table name: my_term_relationships

+----------+---------+---------+
| image_id | term_id | weight  |
+----------+---------+---------+
|        2 |       1 | 0.58516 |
|        2 |       3 | 0.16721 |
|        3 |       2 | 0.21475 |
+----------+---------+---------+

So in this example, the ideal outcome would be to delete rows 1, 4 and either row 5 or 6 from my_images.

I wouldn't post an answer as it's been a long time since I've done real sql queries. I would start by creating a query that delete dups, like the 2nd most populare answer here : https://stackoverflow.com/questions/4685173/delete-all-duplicate-rows-except-for-one-in-mysql Then i'd add in to the subquery that the selected id has to exist in your my_term_relationships. Hope it helps — Logar, Aug 31 '17 at 08:17
By the way, is it possible that the same image_filename is referenced under different ids in `my_term_relationships` ? If yes, then my above proposition will not work. In this case I'd advice to first 'clean up' your `my_terms_relationships` table in order to only have one image_id per image_filename in this table. Then my above comment will be relevant I think — Logar, Aug 31 '17 at 08:22

fancyPants · Accepted Answer · 2017-08-31T11:25:18.367

Approach this step by step.

First, find the duplicate entries:

SELECT 
image_path, image_filename
FROM my_images
GROUP BY image_path, image_filename
HAVING COUNT(*) > 1

Second, get all rows that are duplicate:

SELECT mi.*
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename

Finally, get the IDs not to delete.

SELECT MIN(ID)
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename
HAVING COUNT(*) > 0

see it working live in an sqlfiddle

Check if everything is correct. If yes, convert it into a delete statement.

DELETE my_images.* FROM my_images 
JOIN (
SELECT MIN(ID) AS ID
FROM my_images mi
JOIN (
    SELECT 
    image_path, image_filename
    FROM my_images
    GROUP BY image_path, image_filename
    HAVING COUNT(*) > 1
) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
LEFT JOIN my_term_relationships mtr ON mi.ID = mtr.image_id
WHERE mtr.image_id IS NULL 
GROUP BY mi.image_path, mi.image_filename
HAVING COUNT(*) > 0
) sq USING(ID);

sqlfiddle with delete statement: http://sqlfiddle.com/#!9/ed63de/1/0

EDIT: to also fix the problem that Logar mentioned, use this UPDATE statement before the DELETE statement.

UPDATE my_term_relationships mtr 
JOIN (
    SELECT mi.ID, minID
    FROM my_images mi
    JOIN (
        SELECT 
        image_path, image_filename, MIN(ID) AS minID
        FROM my_images
        GROUP BY image_path, image_filename
        HAVING COUNT(*) > 1
    ) dups ON mi.image_path = dups.image_path AND mi.image_filename = dups.image_filename
) sq ON mtr.image_id = sq.ID 
SET mtr.image_id = sq.minID;

see it all working together in this sqlfiddle: http://sqlfiddle.com/#!9/268539/1/0

Again, I believe if you have two ids referencing the same filename in `my_term_relationships` you will keep the dups in `my_images`, I'd add a first query that updates the ids in `my_term_relationhips` : To see what I mean, in your fiddle, change the values for VALUES (1, 1, 0.58516), (2, 3, 0.16721), (3, 2, 0.21475); — Logar, Aug 31 '17 at 08:43
Great answer, thanks. But when I try the delete query I get the error "You can't specify target table 'my_images' for update in FROM clause" — David Apple, Aug 31 '17 at 08:45
Thanks @fancyPants but I'm getting the error "Duplicate entry '87949-4' for key 'PRIMARY'" when I run your update query. — David Apple, Sep 01 '17 at 10:13
@fancyPants Perfect. I've dropped the private keys and it works beautifully. Thanks for all of your help. — David Apple, Sep 01 '17 at 18:33
@DavidApple You're welcome. Feel free to accept (and/or upvote) my answer. Would be appreciated and others know the problem is solved. — fancyPants, Sep 01 '17 at 21:10

Laposhasú Acsa · Answer 2 · 2017-08-31T09:30:42.117

You need to query two set of image ID's, and make a filter with them. Assuming image_path and image_filename is UNIQUE together:

All my_images ID, that is not referenced by my_term_relationships, but a corresponding image_path+image_filename might be referenced.
Unique ID's, which belongs to image_path+image_filename pairs, that is not referenced at all in my_term_relationships.

Take a look at this query:

DELETE FROM my_images 
WHERE 
   ID NOT IN (SELECT DISTINCT image_id FROM my_term_relationships) -- 1
   AND 
   ID NOT IN (SELECT id FROM (
    SELECT MIN(ID) as id
    FROM my_images 
    LEFT JOIN my_term_relationships ON ID = image_id
     GROUP BY image_path,image_filename
     HAVING COUNT(image_id) = 0
   ) as u_ids -- 2
);

Note, that you must wrap the my_images table in the DELETE's WHERE clause in a sub-query. Read this thread for explanation: Can't specify target table for update in FROM clause

Example: sqlfiddle

Example update query for removing duplicate rows from my_term_relationships:

UPDATE my_term_relationships
SET image_id = (
  select min(my_images.ID) 
  from my_images 
  join my_images as ref_image on (my_images.image_path = ref_image.image_path and my_images.image_filename = ref_image.image_filename)
  where ref_image.ID = image_id
 );

I still have a few duplicate image_path + image_filename pairs after running this query. Perhaps I have rows in my_term_relationships that point to duplicated images. Is there a way to merge these? — David Apple, Aug 31 '17 at 09:01
Then you need to run an UPDATE on my_term_relationships before you delete the rows. — Laposhasú Acsa, Aug 31 '17 at 09:15
Thanks, this is exactly what I'm looking for. But I'm getting the error "Column 'image_id' cannot be null" when I run the update query. — David Apple, Sep 01 '17 at 09:35

Delete duplicate rows but maintain many to many relationship

2 Answers2