-1

Below is the structure of my table diary, which has some entries about trips. The problem is there are around 4,000 duplicate entries which I found after hashing the notes column into hash column. Total entries are around 47,000.

Table structure

I want to delete all the duplicate rows leaving just 1 row based on the hash column

I can find the duplicate rows using

SELECT `url`, COUNT(hash) c from diary GROUP BY `hash` HAVING c > 1 ORDER BY `c` DESC; 

and then use this SQL to find the url id of the duplicate rows

SELECT `url`, places, days,`hash` FROM  diary WHERE `hash` = (select `hash` from diary WHERE `url` = ?); 

Desired output

For this table I want to have any 1 row from the first 2 rows as they have same hash and 3rd, 4th row. Also to be noted in the diary table there may be duplicate hash with count as high as 400.

Ans: Add hash as unique key then INSERT IGNORE INTO diary2 (c1, c2, c3) SELECT cl, c2 FROM diary;

Sourav
  • 17,065
  • 35
  • 101
  • 159

2 Answers2

0

I am not sure about what you mean by "duplicates" exactly, but to answer the last part of your question:

For this table I want to have any 1 row from the first 2 rows as they have same hash and 3rd, 4th row. Also to be noted in the diary table there may be duplicate hash with count as high as 400.

you can use the row_number function for this:

with cte as (
select `url`, places, days,`hash`, 
rownumber() over (partition by hash) rown 
from diary where `hash` = (select `hash` from diary WHERE `url` = ?)
)
select * from cte where rown = 1;

Partitioning by hash, means that the rown column will have 1 as value for the first line with hash=1 and 2 as value for the second line. And by specifying that you want to select rows with rown=1, you will get the first line for hash=1 (and the single line for hash=3 and hash=4.

romborimba
  • 233
  • 2
  • 10
0

Here is some example to delete a duplicate row.

Sample Table Data :

   url    days  places  itenerary    flag  remarks    hash  
------  ------  ------  ---------  ------  -------  --------
     1       1       1          1       0   (NULL)         1
     2       2       2          2       0   (NULL)         1
     3       3       3          3       0   (NULL)         3
     4       4       4          4       0   (NULL)         4
     5       5       5          5       0   (NULL)         4

DELETE QUERY


Delete duplicate using this query according to duplicate hash:

DELETE t1 FROM diary t1 INNER JOIN diary t2
WHERE t1.`url`>t2.`url` AND `t1`.`hash`=`t2`.`hash`;

Result :

1 queries executed, 1 success, 0 errors, 0 warnings
Query: delete t1 from diary t1 inner join diary t2 where t1.`url`>t2.`url` and `t1`.`hash`=`t2`.`hash`
2 row(s) affected

   url    days  places  itenerary    flag  remarks    hash  
------  ------  ------  ---------  ------  -------  --------
     1       1       1          1       0   (NULL)         1
     3       3       3          3       0   (NULL)         3
     4       4       4          4       0   (NULL)         4

This will delete any duplicates and leave 1 copy with the lowest url number per duplicate.

Link: How To Delete Duplicate Rows in MySQL

Kuro Neko
  • 795
  • 12
  • 19