Filtering duplicates in a table without a primary key

Question

I would really appreciate your help with the following problem:

I have a table (stock_header) which has 3 columns of interest

catalogue_code  purcha  stock_
--------------  ------  ------
1001            Box     Box
1001            Box     Box
1002            EA      EA
1002            Set     Set
1002            Kit     Kit
1004            Set     Set

I would like to extract the information using the following rules:

If a catalogue_code is a duplicate and all its purcha and stock_ attributes have the same value (e.g catalogue_code 1001), select any record.
If a catalogue_code is a duplicate and some of its purcha and stock_ attributes contain different values (e.g catalogue_code 1002), select the record that has purcha = 'EA', stock_ = 'EA'.

I am having difficulties implementing this logic in T-SQL.

Update: Just wanted to mention that the original table doesn't have a primary key. The removal of duplicates is done so that the catalogue_code can be the primary key.

Update2: Unfortunately, answers by Nick and Elian do not solve the problem. Here is my pseudo code for a potential implementation:

Find all the duplicated catalogue_codes
For each duplicate do:
Get any pair of stock_ and purcha attributes. Compare the selected pair to all other pairs and store your results in a variable (numberOfIdenticalRecords).
If numberOfIdenticalRecords = 1 (meaning that all the other records have distinct stock_ and purcha values). In this case, select the record that has purcha = 'EA' and stock_ = 'EA'.
Else if numberOfIdenticalRecords > 1 (meaning that all the records contain the same stock_ and purcha values), select any record.

One big drawback of this implementation is that it is procedural whereas SQL is declarative. Even if it's possible to implement it, chances are that it's going to be hard to understand and maintain. Am I over thinking the whole thing? Is there a simpler way of doing it?

You will have a hard time deleting duplicates if *all* columns are identical. Is there an additional primary key column? The full table definition would help. — Erwin Brandstetter, Oct 08 '11 at 02:04
What should be selected if a multiple rows have the same `catalogue_code`, but not all the same and without `EA` in `purcha` and `stock_`? For example, if there was another row for `1004` with `Kit` / `Kit`, which should be selected? — Jeff Ogata, Oct 08 '11 at 02:12
If you want to remove the duplicate data, [this CTE approach](http://stackoverflow.com/questions/18932/sql-how-can-i-remove-duplicate-rows/3822833#3822833) works best, whether or not all the columns are identical. I would then impose a constraint on the table so the duplicate data problem doesn't come back. — Nick Chammas, Oct 08 '11 at 02:14
@adrift Good point. For the moment I am assuming that such records do not exist. — Sergey Koulikov, Oct 08 '11 at 02:26
@Erwin My apologies, the title of the question wasn't clear. I only need to filter off the duplicates so that I can join this result set to another table in which catalogue_code is the primary key. — Sergey Koulikov, Oct 08 '11 at 02:45

Elian Ebbing · Answer 1 · 2011-10-08T03:07:13.350

2

In the title you mention removing rows, but your question sounds like you want a select query. I must say that your requirements sound a bit strange, but I guess the following query will give you what you are looking for:

select 
    catalogue_code, purcha, stock_
from 
    stock_header
where 
    (purcha = 'EA' and stock_ = 'EA') 
    or catalogue_code not in(select catalogue_code from stock_header where purcha = 'EA' AND stock_ = 'EA')
group by 
    catalogue_code, purcha, stock_

If you really need to delete rows, then you need something to uniquely identify a row, like Erwin Brandstetter mentioned in his comment.

edited Oct 08 '11 at 03:07

answered Oct 08 '11 at 02:09

Elian Ebbing

18,779
5
48
56

Thank you for your answer. I will update the title. Yes, a select query is what I am after. The requirement might sound a bit strange. This data comes from a legacy inventory and procurement system and it's to be imported into an IBM asset management system. It has been a painful task so far. – Sergey Koulikov Oct 08 '11 at 02:14
3

Actually, you don't need to uniquely identify a row to remove duplicate data if you are on SQL Server 2005 or later. You can use [this CTE approach](http://stackoverflow.com/questions/18932/sql-how-can-i-remove-duplicate-rows/3822833#3822833) to remove duplicate data, regardless of your schema. – Nick Chammas Oct 08 '11 at 02:54
@Elian I have executed your query and it filters off 425 records. The only problem is that there are only 408 duplicated records in the table. – Sergey Koulikov Oct 08 '11 at 02:54
Query works as intended: http://sqlize.com/t67uWlM2F2 Well done! FYI, just a typo in "stock" :) – Mosty Mostacho Oct 08 '11 at 03:02
@Mosty - Yeah, I didn't test my query, so I was bound to make a typo :) cool website btw. – Elian Ebbing Oct 08 '11 at 03:06
@Mosty I am not saying that the query doesn't work. What I don't understand is when I run it against the real table it filters off more records than I expect it to. Which means that I am missing something here. – Sergey Koulikov Oct 08 '11 at 03:09

score 2 · Accepted Answer · edited Jan 18 '21 at 12:30

2

declare @T table
(
  catalogue_code int,
  purcha varchar(5),
  stock_ varchar(5)
)

insert into @T values
(1001,            'Box',     'Box'),
(1001,            'Box',     'Box'),
(1002,            'AA',      'AA'),
(1002,            'MM',      'MM'),
(1002,            'EA',      'EA'),
(1002,            'Set',     'Set'),
(1002,            'Kit',     'Kit'),
(1004,            'Set',     'Set')

;with C as
(
  select *,
         row_number() over(partition by catalogue_code 
                           order by case when purcha = 'EA'
                                         then 0
                                         else 1
                                    end) as rn
  from @T
)
select *
from C
where rn = 1

Result:

catalogue_code purcha stock_ rn
-------------- ------ ------ --------------------
1001           Box    Box    1
1002           EA     EA     1
1004           Set    Set    1

Try it on SE-Data Explorer: https://data.stackexchange.com/stackoverflow/q/114648/

edited Jan 18 '21 at 12:30

Community

1
1

answered Oct 09 '11 at 04:09

Mikael Eriksson

136,425
22
210
281

Thank you. After reading your solution I realized that my explanation is still not as clear as it should have been. All the records/tuples always have purcha = stock_. I don't have records like (1001, Box1, Box) but (1001, Box, Box). I understand how to filter out duplicates that don't pass constraints I impose on their attributes. But in this case, comparing purcha and stock_ attributes is not helpful. Perhaps I should have written "If a catalogue_code is a duplicate and all its purcha and stock_ attributes have the same value across all records"? – Sergey Koulikov Oct 09 '11 at 04:56
@Sergey - Changed the indata. That does not affect the solution but only makes it simpler. Or is this not the desired result with the given input? – Mikael Eriksson Oct 09 '11 at 04:58
@ Mikael It's a row comparison I am after, not attribute comparison. If I wanted to compare attributes, then your solution, Nick's solution and Elian's solutions are fine. – Sergey Koulikov Oct 09 '11 at 04:59
I would like to choose a record, say (1001, EA, EA) and compare it to, say (1001, Box, Box), (1001, Kit, Kit), (1001, Msg, Msg), etc. – Sergey Koulikov Oct 09 '11 at 05:01
@Sergey - What is the desired result with the given input? – Mikael Eriksson Oct 09 '11 at 05:03
Your result is correct with the input that you have provided. But when I run this query against the real dataset, the query filters off more records than it should. There are only 408 duplicate records in the table. Your (and Elian's) query removes 425 records. – Sergey Koulikov Oct 09 '11 at 05:11
1

My apologies, Mikael. I double checked the results of your query and all the right records are present. Your solution is correct. – Sergey Koulikov Oct 09 '11 at 07:42

Filtering duplicates in a table without a primary key

2 Answers2

Linked