0

I wanted some help in regards to understanding how I can delete duplicate records from my database table. I have a table of 1 million records which has been collected over a 2 year period hence there is a number of records that need to be deleted as they have been added numerous times into the database.

The following is a query that I wrote based on the three columns that I am matching for duplicates, taking a count and I have also added a length of one of the columns as this will determine whether I delete all the records or just the duplicates.

SELECT
    Ref_No, 
    End_Date, 
    Filename,
    count(*) as cnt,
    length(Ref_No)
FROM
    master_table
GROUP BY
    Ref_No, 
    End_Date, 
    Filename,
    length(Ref_No)
HAVING
    COUNT(*) > 1
    ;

This then gives me an output like the following:

Ref_No | End_Date | Filename | cnt | length(Ref_No)
05011384 | 2018-07-01 | File1 | 2 | 8
1234 | 2018-12-31 File2 | 11 | 4
1000002975625 | 2018-12-31 | File3 | 13
123456789123456789 | 2019-02-06 | File3 | 18

Now I have a list of rules to follow based on the length column and this will determine whether I leave the records as they are with the duplicates, delete the duplicates or delete all the records and this is where I am stuck.

My rules are the following:

If length is between 0 and 4 - Keep all records with duplicates

If length is between 5 and 10 - Delete Duplicates, keep 1 record

If length equals 13 - Delete Duplicates, keep 1 record

If length is 11, 12, 14-30 - Delete all records

I would really appreciate if some could advice on how I go about completing this task.

Thanks.

Michael Owen
  • 365
  • 3
  • 20
  • Is there a primary key in the table, so you can distinguish the "duplicate" rows? – Paul Spiegel Feb 25 '20 at 17:53
  • `SELECT ROW_NUMBER, * FROM master_Table WHERE((length(Ref_No) between 5 and 10 or length(Ref_no)=13) and row_number>1) or (length(Ref_no) in (11,12) or length(Ref_no)>=14` This will select all records you think can be deleted, To get ROW_NUMBER in MySQL see: https://stackoverflow.com/questions/1895110/row-number-in-mysql – Luuk Feb 25 '20 at 17:58
  • Hi, Paul, there is a primary key which is an ID column. I also have the End Date column which is indexed as well. – Michael Owen Feb 25 '20 at 18:30
  • Hi Luuk, tried the query, but I am guessing as i have version 5.7, ROW_NUMBER does not work. Will go through the link you mentioned. – Michael Owen Feb 26 '20 at 10:08

1 Answers1

0

I have managed to create a temporary table in which I add a unique id. The only thing is that I am running the query twice with the length part changed for my requirements.

INSERT INTO UniqueIDs
    (
    SELECT
        T1.ID
    FROM
        master_table T1
    LEFT JOIN
        master_table T2
    ON
    (
        T1.Ref_No = T2.Ref_No
    AND
        T1.End_Date = T2.End_Date
    AND
        T1.Filename = T2.Filename
    AND
        T1.ID > T2.ID
    )
    WHERE T2.ID IS NULL
    AND
        LENGTH(T1.Ref_No) BETWEEN 5 AND 10
    )
    ;

I then just run the following delete to keep the unique ids in the table and remove the rest.

DELETE FROM master_table WHERE id NOT IN (SELECT ID FROM UniqueIDs);

That's it.

Michael Owen
  • 365
  • 3
  • 20