2

i have a table eng-jap which is essentially just a translation so having an english and a japanese column. a script i made somehow cause every insert to have a clone and thus 1000s of duplicate entries in this table, for example:

duplicate example A

eng                        jap
"mother washes every day"  "母は毎日洗濯する"
"mother washes every day"  "母は毎日洗濯する"

if it were just one column i could use the query:

SELECT eng, COUNT(*) c FROM `eng-jap` GROUP BY eng HAVING c > 1

but since the table can legitimately have a duplicates in eng or jap, as long as its not in both. for example:

duplicate example B

eng                        jap
"mother washes every day"  "母は毎日洗濯する"
"every day mother washes"  "母は毎日洗濯する"

this is to allow one sentence to have more than one translation. so i need to alter the query to find duplicates as a combination of both columns i guess you could say.

once again to be clear. example B is fine, i want to select all duplicates like example A so i can make a scrip to remove one of all of the duplicates. please and Thank you!

Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
user1397417
  • 708
  • 4
  • 11
  • 34
  • 2
    In your case, Since there could be multiple translations for one english sentence, I think it is better to split this table into three tables `english: id, sentence`, `Jap: id, sentence`, `translation: id, englishid, japid` – Mahmoud Gamal Nov 11 '12 at 07:57
  • related but not exactly the same, matchig on multiple columns http://stackoverflow.com/questions/16324328/mysql-select-records-for-duplicates-using-multiple-columns – jhanifen Oct 15 '13 at 02:55

3 Answers3

2

I think you just need to group by eng and jap:

SELECT eng, jap, COUNT(*) c FROM `eng-jap` GROUP BY eng, jap HAVING c > 1

And if you want to remove all duplicates, if your rows have an id, this query shows all the ids that you have to keep:

select
  SUBSTRING_INDEX(GROUP_CONCAT(CAST(id AS CHAR) order by id), ',', 1) as id
from `eng-jap`
group by eng, jap

(it's a trick that uses GROUP_CONCAT to find the first id of every combination of eng/jap). And this query shows the ids of the rows you have to delete:

select id
from
  `eng-jap`
     left join
  (select
     SUBSTRING_INDEX(GROUP_CONCAT(CAST(id AS CHAR) order by id), ',', 1) as id
     from `eng-jap`
     group by eng, jap) `eng-jap-dup`
  on `eng-jap`.id = `eng-jap-dup`.id
where `eng-jap-dup`.id is null

I rewrote this query using just join, it has to be a little faster, but if your table is too big it is probably still slow.

If it is still too slow and it still doesn't work, i would suggest you to add two more columns to your table:

  • eng-hash, where you can save MD5(eng)
  • jap-hash, where you can save MD5(jap)

then update all of your records like this:

update `eng-jap` set `eng-jap`.`eng-hash` = MD5(eng), `eng-jap`.`jap-hash` = MD5(jap)

then you can add a unique index on the table on both columns, ignore all errors, and let MySql do the work to eliminate duplicates for you:

alter ignore table `eng-jap` add unique index (eng-hash, jap-hash);

(if you get an error while creating index, see this question: MySQL: ALTER IGNORE TABLE gives "Integrity constraint violation")

Community
  • 1
  • 1
fthiella
  • 48,073
  • 15
  • 90
  • 106
  • i'd like to add the index like you sayy, but i get an error trying to "#1071 - Specified key was too long; max key length is 1000 bytes" but your select query works and is exactly what i asked for. though the second one to show th id's works for a long time and then says aborted – user1397417 Nov 11 '12 at 10:26
  • i edited my answer... it's not difficult to find duplicates, but since your table looks big it's difficult to remove all duplicates except one... i gave you some suggestions, i could explain in more detail... – fthiella Nov 11 '12 at 13:42
1

You could temporarily add a column where you simply concatenate the value of both current columns. Then you make your query grouping by that column and eliminate matching rows. Afterwards you drop the column again.

The same can be done inside the query itself of course, just group by a synthesized result column that contains the concatenation of both physival columns.

arkascha
  • 41,620
  • 7
  • 58
  • 90
  • i would like to do this but i cant think of a way to set the new column to be the eng + jap of its own row. – user1397417 Nov 11 '12 at 09:47
  • MySQL offers a convenient `CONCAT()` function for this. So you can simply do something like: `SELECT CONCAT(eng,'-',jap) as whatever...` and group by that column. That's all. http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_concat – arkascha Nov 11 '12 at 11:45
0

I'm not sure if you have already got answers here. I'm also not sure if you're trying to read only duplicated rows, or you want a result set with no duplicates. It all depends on what the definition of a duplicate is for you.

1) Assuming you need only duplicated rows (as your title says):

a) If in example A there is a duplicate and in example B there are no duplicates, then just:

SELECT eng, COUNT(*) c FROM `eng-jap` GROUP BY eng, jap HAVING c > 1

b) If you find a duplicate even in example B (since jap part is the same), then:

SELECT   * 
FROM     (
          SELECT   * 
          FROM     (
                    SELECT   * 
                    FROM     `eng-jap` 
                    GROUP BY eng 
                    HAVING   COUNT(*) > 1

                    UNION ALL

                    SELECT   * 
                    FROM     `eng-jap` 
                    GROUP BY jap 
                    HAVING   COUNT(*) > 1
                   ) AS t
          GROUP BY eng
             ) AS v
GROUP BY jap

2) If you're looking at records with no duplicate rows (since you say at the last i want to select all duplicates like example A so i can make a scrip to remove one of all of the duplicates):

a) If in example A there is a duplicate and in example B there are no duplicates, then just:

SELECT eng, COUNT(*) c FROM `eng-jap` GROUP BY eng, jap

b) If you find a duplicate even in example B (since jap part is the same), then:

SELECT   * 
FROM     (SELECT c.eng, c.jap FROM `eng-jap` c GROUP BY c.eng) t 
GROUP BY t.jap
nawfal
  • 70,104
  • 56
  • 326
  • 368