Interview - Detect/remove duplicate entries

Question

how to detect/remove duplicate entries from a database in a table where there is no primary key ?

[If we use 'DISTINCT' how do we know which record is the correct one and duplicate one ? ]

What do you mean by "duplicate one"? Shouldn't they both have the same values for all of the fields? Otherwise, it wouldn't really be "duplicate" — JosephH, Oct 04 '11 at 14:33
lets think that few columns are equal and some are different — Sudantha, Oct 04 '11 at 14:38
"lets think that few columns are equal and some are different" in that case they are distinct rows. In any case, look at http://stackoverflow.com/questions/18932/sql-how-can-i-remove-duplicate-rows — secreteyes, Oct 04 '11 at 22:11

score 3 · Accepted Answer · answered Oct 05 '11 at 03:57

delete f
from
(
    select ROW_NUMBER() 
        over (partition by 
            YourFirstPossibleDuplicateField,
            YourSecondPossibleDuplicateField
            order by WhateverFieldYouWantSortedBy) as DelId
    from YourTable
) as f
where DelId > 1

Dave · Answer 2 · 2011-10-04T22:57:29.613

I created a view where DISTINCT actually was not a part of the query, but PARTITION. I needed the most recent entry to records with the same Ordernum and RecordType fields, discarding the others. The partitions are ordered by date, and then the top row is selected, like this:

SELECT *, ROW_NUMBER() 
OVER (PARTITION BY OrderNum, RecordType ORDER BY DateChanged DESC) rn
FROM HistoryTable SELECT * FROM q WHERE rn = 1

score 0 · Answer 3 · answered Oct 05 '11 at 07:43

If we use 'DISTINCT' how do we know which record is the correct one and duplicate one?

If you have duplicate rows then doesn't matter which duplicate is picked because they are all the same!

I guess when you say "there is no primary key" that you actually mean there is no simple single-column 'surrogate' candidate key such as an incrementing sequence of integers, preferably with no gaps, but that there is a multi-column compound 'natural' candidate key (though does not comprise all the columns).

If this is the case, you'd look for something to break ties e.g. a column named DateChanged as per @Dave's answer. Otherwise, you need to pick am arbitrary row e.g. the answer by @Surfer513 does this using the ROW_NUMBER() windowed function over (YourFirstPossibleDuplicateField, YourSecondPossibleDuplicateField) (i.e. your natural key) then picking the duplicate that got arbitrarily assigned the row number 1.

Interview - Detect/remove duplicate entries

3 Answers3

Linked

Related