Deleting duplicate data in MySQL

Question

I'm trying to emulate the accepted answer in this SO question: Delete all Duplicate Rows except for One in MySQL? [duplicate] with a twist, I want the data (auto-incrementing ID's) of one table to determine which rows to delete in another table. SQLFiddle here showing data.

In the fiddle referenced above, the end result I'm looking for is the rows in eventdetails_new with Event_ID = 4 & 6 to be deleted (EVENTDETAILS_ID's 5 & 6, and 9 & 10), leaving rows 3 & 5 (EVENTDETAILS_ID's 3 & 4 and 7 & 8). I hope that made sense. Ideally the rows in events_new with those same Event_ID's would get deleted as well (which I haven't started working on yet, so no code samples).

This is the query I'm trying to make work, but I'm a bit over my head:

SELECT *
FROM eventdetails_new AS EDN1, eventdetails_new AS EDN2
INNER JOIN events_new AS E1 ON `E1`.`Event_ID` = `EDN1`.`Event_ID`
INNER JOIN events_new AS E2 ON `E2`.`Event_ID` = `EDN2`.`Event_ID`
WHERE `E1`.`Event_ID` > `E2`.`Event_ID` 
AND `E1`.`DateTime` = `E2`.`DateTime`
AND events_new.EventType_ID = 6;

Here's the same SQLFiddle with the results of this query. Not good. I can see the Event_ID in the data, but the query cannot for some reason. Not sure how to proceed to fix this.

I know it's a SELECT query, but I couldn't figure out a way to have two aliased tables in the DELETE query (which I think I need?). I figured if I could get a selection, I could delete it with some C# code. However ideally it could all be done in a single query or set of statements without having to go outside of MySQL.

Here's my first cut at the query, but it's just as bad:

DELETE e1 FROM eventdetails_new e1 
WHERE `events_new`.`Event_ID` > `events_new`.`Event_ID` 
AND events_new.DateTime = events_new.DateTime AND events_new.EventType_ID = 6;

SQLFiddle won't let me run this query at all, so it's not much help. However, it give me the same error as the one above: Error Code: 1054. Unknown column 'events_new.Event_ID' in 'where clause'

I'm by no means married to either of these queries if there's a better way. The end result I'm looking for is deleting a bunch of duplicate data.

I have hundreds of thousands of these results, and I know that roughly 1/3 of them are duplicates that I need to get rid of before we go live with the database.

How about a different approach: determine `SELECT` query to get the end result you are looking for using `DISTINCT` and/or `GROUP BY` to filter out duplicates. Then load the results into a temporary table. `TRUNCATE` your existing table (backup first!), then load with the results from the `SELECT`. I find it easier to get my head around selecting unique rows, then deleting duplicates. Not sure if this is an option so haven't posted as answer. — adey_888, Sep 12 '14 at 01:08
I'm open to any method to get this done. Worst comes to worst, I can recreate the entire database from logs in about 5 hours and we're still in the "testing" phase (my company doesn't exactly understand the concept, although I'm digging my heels in on this). If you can provide a code example that'd be outstanding. The SQLFiddle I provided is a good example of what I'm looking at on an extremely small scale. — delliottg, Sep 12 '14 at 03:07

score 0 · Accepted Answer · answered Sep 17 '14 at 17:53

Here's what I eventually ended up doing. My co-worker & I came up with a query that would give us a list of Event_ID's that had duplicate data (we actually used Access 2010's query builder and MySQL-ified it). Bear in mind this is a complete solution where the original question didn't have as much detail as far as linked tables. If you've got questions about this, feel free to ask & I'll try to help:

SELECT `Events_new`.`Event_ID`
    FROM Events_new
    GROUP BY `Events_new`.`PCBID`, `Events_new`.`EventType_ID`, `Events_new`.`DateTime`, `Events_new`.`User`
    HAVING (((COUNT(`Events_new`.`PCBID`)) > 1) AND ((COUNT(`Events_new`.`User`)) > 1) AND ((COUNT(`Events_new`.`DateTime`)) > 1))

From this I processed each Event_ID to remove the duplicates in an iterative manner. Basically I had to delete all the child rows starting from the last lowest table so that I didn't run afoul of foreign key restraints.

This chunk of code was written in LinqPAD as C# statements: (sbCommonFunctions is an inhouse DLL designed to make most (but not all as you'll see) database functions be handled the same way or easier)

sbCommonFunctions.Database testDB = new sbCommonFunctions.Database();
testDB.Connect("production", "database", "user", "password");
List<string> listEventIDs = new List<string>();
List<string> listEventDetailIDs = new List<string>();
List<string> listTestInformationIDs = new List<string>();
List<string> listTestStepIDs = new List<string>();
List<string> listMeasurementIDs = new List<string>();
string dtQuery = (String.Format(@"SELECT `Events_new`.`Event_ID`
        FROM Events_new
        GROUP BY `Events_new`.`PCBID`,
        `Events_new`.`EventType_ID`,
        `Events_new`.`DateTime`,
        `Events_new`.`User`
        HAVING (((COUNT(`Events_new`.`PCBID`)) > 1)
        AND ((COUNT(`Events_new`.`User`)) > 1) 
        AND ((COUNT(`Events_new`.`DateTime`)) > 1))"));

int iterations = 0;
DataTable dtEventIDs = getDT(dtQuery, testDB);
while (dtEventIDs.Rows.Count > 0)
{
    Console.WriteLine(dtEventIDs.Rows.Count);
    Console.WriteLine(iterations);
    iterations++;
    foreach(DataRowView eventID in dtEventIDs.DefaultView)
    {
        listEventIDs.Add(eventID.Row[0].ToString());
        DataTable dtEventDetails = testDB.QueryDatabase(String.Format(
        "SELECT * FROM EventDetails_new WHERE Event_ID = {0}",
           eventID.Row[0]));
        foreach(DataRowView drvEventDetail in dtEventDetails.DefaultView)
        {
            listEventDetailIDs.Add(drvEventDetail.Row[0].ToString());
        }   
        DataTable dtTestInformation = testDB.QueryDatabase(String.Format(
        @"SELECT TestInformation_ID 
        FROM TestInformation_new 
        WHERE Event_ID = {0}",
            eventID.Row[0]));
        foreach(DataRowView drvTest in dtTestInformation.DefaultView)
        {
            listTestInformationIDs.Add(drvTest.Row[0].ToString());
            DataTable dtTestSteps = testDB.QueryDatabase(String.Format(
            @"SELECT TestSteps_ID 
            FROM TestSteps_new 
            WHERE TestInformation_TestInformation_ID = {0}",
               drvTest.Row[0]));
            foreach(DataRowView drvTestStep in dtTestSteps.DefaultView)
            {
                listTestStepIDs.Add(drvTestStep.Row[0].ToString());
                DataTable dtMeasurements = testDB.QueryDatabase(String.Format(
                @"SELECT Measurements_ID 
                FROM Measurements_new 
                WHERE TestSteps_TestSteps_ID = {0}",
                   drvTestStep.Row[0]));
                foreach(DataRowView drvMeasurements in dtMeasurements.DefaultView)
                {
                    listMeasurementIDs.Add(drvMeasurements.Row[0].ToString());
                }
            }
        }
    }
    testDB.Disconnect();
    string mysqlConnection = 
    "server=server;\ndatabase=database;\npassword=password;\nUser ID=user;";
    MySqlConnection connection = new MySqlConnection(mysqlConnection);
    connection.Open();
    //start unwinding the duplicates from the lowest level upward
    whackDuplicates(listMeasurementIDs, "measurements_new", "Measurements_ID", connection);
    whackDuplicates(listTestStepIDs, "teststeps_new", "TestSteps_ID", connection);
    whackDuplicates(listTestInformationIDs, "testinformation_new", "testInformation_ID", connection);
    whackDuplicates(listEventDetailIDs, "eventdetails_new", "eventdetails_ID", connection);
    whackDuplicates(listEventIDs, "events_new", "event_ID", connection);
    connection.Close();
    //update iterator from inside the clause in case there are more duplicates.
    dtEventIDs = getDT(dtQuery, testDB);    }

}//goofy curly brace to allow LinqPAD to deal with inline classes
public void whackDuplicates(List<string> listOfIDs,
                            string table,
                            string pkID, 
                            MySqlConnection connection)
{
    foreach(string ID in listOfIDs)
    {
        MySqlCommand command = connection.CreateCommand();
        command.CommandText = String.Format(
        "DELETE FROM " + table + " WHERE " + pkID + " = {0}", ID);
        command.ExecuteNonQuery();
    }
}
public DataTable getDT(string query, sbCommonFunctions.Database db)
{
    return db.QueryDatabase(query);
//}/*this is deliberate, LinqPAD has a weird way of dealing with inline
     classes and the last one can't have a closing curly brace (and the 
     first one has to have an extra opening curly brace above it, go figure)
   */

Basically this is a giant while loop, and the clause iterator is updated from inside the clause until the number of Event_ID's drops to zero (it takes 5 iterations, some of the data has as many as six duplicates).

Deleting duplicate data in MySQL

1 Answers1