I wanted some help in regards to understanding how I can delete duplicate records from my database table. I have a table of 1 million records which has been collected over a 2 year period hence there is a number of records that need to be deleted as they have been added numerous times into the database.
The following is a query that I wrote based on the three columns that I am matching for duplicates, taking a count and I have also added a length of one of the columns as this will determine whether I delete all the records or just the duplicates.
SELECT
Ref_No,
End_Date,
Filename,
count(*) as cnt,
length(Ref_No)
FROM
master_table
GROUP BY
Ref_No,
End_Date,
Filename,
length(Ref_No)
HAVING
COUNT(*) > 1
;
This then gives me an output like the following:
Ref_No | End_Date | Filename | cnt | length(Ref_No)
05011384 | 2018-07-01 | File1 | 2 | 8
1234 | 2018-12-31 File2 | 11 | 4
1000002975625 | 2018-12-31 | File3 | 13
123456789123456789 | 2019-02-06 | File3 | 18
Now I have a list of rules to follow based on the length column and this will determine whether I leave the records as they are with the duplicates, delete the duplicates or delete all the records and this is where I am stuck.
My rules are the following:
If length is between 0 and 4 - Keep all records with duplicates
If length is between 5 and 10 - Delete Duplicates, keep 1 record
If length equals 13 - Delete Duplicates, keep 1 record
If length is 11, 12, 14-30 - Delete all records
I would really appreciate if some could advice on how I go about completing this task.
Thanks.