1

Confession: a mysql newb requires simple example to locate duplicate rows in somewhat large table. I have searched for and read many other threads with similar titles, but the examples are so complex that I cannot apply them to my basic situation.

A MySQL table has only 5 fields, but there are hundreds of rows. I wish to locate duplicate rows -- I know there is one for sure and wonder if there are others.

Example Row: (rel_id is auto-incrementing, primary key field)

'rel_id' => 1
'host' => 17
'host_type' => 'client'
'rep' => 7
'rep_type => 'cli_mgr'

My approach was to:
1. Read entire table into mysql query
2. row-by-row compare the 4 data fields to those of previous ("done") rows
3. after comparing a "new" row, append it to array of "done" rows

Here is what I have tried. I am sure that there must be a much simpler solution. You will see that I am bogged down in trying to append the "new" row to the array of "done" rows:

$rRels = mysql_query("SELECT * FROM `rels`");
$a = array();
$e = array();
$c1 = 0;
$c2 = 0;
While ($r = mysql_fetch_assoc($rRels)) {
    $i = $r['rel_id'];
    $h = $r['host'];
    $ht = $r['host_type'];
    $r = $r['rep'];
    $rt = $r['rep_type'];

    foreach($a as $row) {
        $xh = $row['host'];
        $xht = $row['host_type'];
        $xr = $row['rel'];
        $xrt = $row['rel_type'];

        if (($h==$xh) && ($ht==$xht) && ($r==$xr) && ($rt==$xrt)) {
            echo 'Found one<br>';
            $e[] = $r;
        }
        $c2++;
    }
    $a = array_merge(array('rel_id'=>$i, 'host'=>$h, 'host_type'=>$ht, 'rep'=>$r, 'rep_type'=>$rt), $a);
    $c1++;
}

echo '<h3>Duplicate Rows:</h3>';
foreach ($e as $row) {
    print_r($row);
    echo '<br>';
}
echo '<br><br>';
echo 'Counter 1: ' . $c1 . '<br>';
echo 'Counter 2: ' . $c2 . '<br>';
cssyphus
  • 37,875
  • 18
  • 96
  • 111

4 Answers4

4

This should do the trick:

SELECT COUNT(*) as cnt, GROUP_CONCAT(rel_id) AS ids
FROM rels
GROUP BY host, host_type, rep, rep_type
HAVING cnt > 1

any "duplicate" records will have a cnt > 1, and the group_concat will give you the ids of the duped records.

Marc B
  • 356,200
  • 43
  • 426
  • 500
1

Pure no-php solution : to make the copy of the old table (named oldTable) , with no data

create table newTable like oldTable;

Modify the structure to prevent duplicates and add unique key over all 5 columns.

alter table newTable add unique index(rel_id,host,host_type,rep,rep_type );

Then whith sql query copy the rows from oldTable

insert IGNORE into newTable select * from oldTable

In newTable you have only the unique data.

Another option is group by, if you will get the number of duplicate rows use

select  concat_ws('_',rel_id,host,host_type,rep,rep_type) as str, count(*) 
from oldTable 
group by str
Ladislav Zigo
  • 484
  • 4
  • 9
  • I tried your solution but was unable to get it to add only unique rows. It just reproduced the old table, duplicate rows and all. – cssyphus Nov 07 '12 at 19:31
  • Query 2 (alter table...) included `rel_id`, which was a unique primary key. By omitting that field from the `add unique index` statement, I was able to create a second table without the duplicate rows. However, the rows were not entered in order! Do you know of any way to force them to be ORDERED as they are entered? Anyway, I wish there was a way to mark TWO answers correct! I will transfer to you the same number of points you would have received for a correct answer. – cssyphus Nov 07 '12 at 19:55
  • 1
    change the second query to: insert IGNORE into newTable select * from oldTable order by rel_id – Ladislav Zigo Nov 07 '12 at 20:40
  • Thanks Lacoz. You know, that's exactly what I tried but it didn't work. The records were inputted in the same scrambled fashion. Here is the exact string that I used: `insert IGNORE into rels2 select * from rels ORDER BY rel_id` – cssyphus Nov 08 '12 at 19:26
  • Thank you, btw, for the updates on your answer. Please feel free to weigh-in on these two meta questions: [first](http://meta.stackexchange.com/questions/154971/rewarding-additional-answers) and [second](http://meta.stackexchange.com/questions/155078/bounty-system-discriminates-against-new-users#155080), since they concern you. – cssyphus Nov 08 '12 at 19:27
0

You can this query to find all the duplicate rows. Hopefully, it should be easy integrating in your PHP code.

// This will give you all the duplicates
// Self join where all the columns have the same values but different primary keys

SELECT * 
FROM   rels t1, rels t2
WHERE  t1.rel_id != t2.rel_id
AND    t1.host = t2.host
AND    t1.host_type = t2.host_type
AND    t1.rep = t2.rep
AND    t1.rep_type = t2.rep_type
Vaibhav Desai
  • 2,618
  • 1
  • 16
  • 16
  • This looks interesting. I saw this approach in other questions but did not understand the t1, t2 concept. Could you please explain, or direct to info page? – cssyphus Nov 07 '12 at 19:05
  • 1
    t1 and t2 are aliases for the table names. Usually they are used for convenience because you don't want to use full table names everytime. But in this case, where we are performing a self-join, it becomes mandatory to distinguish the same column names from the same table. Here is a w3schools link: http://www.w3schools.com/sql/sql_alias.asp – Vaibhav Desai Nov 07 '12 at 19:26
  • 1
    Alternatively, to be more clear, you can say FROM rels AS t1, rels AS t2. Its the same way you would provide an alias for a column name. SELECT COUNT(1) AS count FROM someTable; – Vaibhav Desai Nov 07 '12 at 19:27
0

Finding duplicates is more easily done in SQL than in PHP.

SELECT GROUP_CONCAT(rel_id)
FROM rels
GROUP BY host, host_type, rep, rep_type HAVING COUNT(rel_id)>1;

This will show the groups of rel_id that point to identical records. The HAVING COUNT(rel_id)>1 clause allows us to skip unduplicated records.

dnagirl
  • 20,196
  • 13
  • 80
  • 123