0

I am currently working on a DataImport script, which is intended to move data from one database to another. The main problem I have come across is that the table in question contains a lot of duplicate records, with the duplicate fields being Product Code, Language, Legislation, Brand Name, Formula and Version, i.e we might have the following in the database:

My Test Product, English, UK, Test Brand, Test Formula, 1 (ID 1 - not included in group by)
My Test Product, English, UK, Test Brand, Test Formula, 1 (ID 2 - not included in group by)
My Test Product, English, UK, Test Brand, Test Formula, 1 (ID 3 - not included in group by)
My Test Product, English, UK, Test Brand, Test Formula, 1 (ID 4 - not included in group by)

As you can see these records are identical in every way. My problem is, that as part of the data load script I wish to delete records with IDs of 1, 2 and 3 whilst retaining the record with the ID of 4 as this will be the most up-to-date record and hence the one I wish to keep. To do this, I have written a T-SQL script as follows:

-- get the list of items where there is at least one duplicate
DECLARE cDuplicateList CURSOR FOR
SELECT productcode, languageid, legislationid, brandName, versionnumber, formulaid
FROM allproducts
GROUP BY productcode, languageid, legislationid, brandName, versionnumber, formulaid
HAVING COUNT (*) > 1  

OPEN cDuplicateList

FETCH cDuplicateList INTO @productCode, @languageId, @legislationId, @brandName, @versionNumber, @formulaId

-- while there are still duplicates
WHILE @@FETCH_STATUS=0
BEGIN

-- delete from the table where the product ID is in the sub-query, which contains all
-- of the records apart from the last one
DELETE FROM AllProducts 
WHERE productId IN
(
    SELECT productId
    FROM allProducts
    WHERE productCode = @productCode 
        AND (languageId = @languageId OR @languageId IS NULL) 
        AND (legislationId = @legislationId OR @legislationId IS NULL)
        AND (brandName = @brandName OR @brandName IS NULL)
        AND (versionNumber = @versionNumber OR @versionNumber IS NULL)
        AND (formulaId = @formulaId OR @formulaId IS NULL)
    EXCEPT
    SELECT TOP 1 productId
    FROM allProducts
    WHERE productCode = @productCode 
        AND (languageId = @languageId OR @languageId IS NULL) 
        AND (legislationId = @legislationId OR @legislationId IS NULL)
        AND (brandName = @brandName OR @brandName IS NULL)
        AND (versionNumber = @versionNumber OR @versionNumber IS NULL)
        AND (formulaId = @formulaId OR @formulaId IS NULL)
)

FETCH cDuplicateList INTO @productCode, @languageId, @legislationId, @brandName, @versionNumber, @formulaId

END

Now, this does work - its just incredibly slow and I can't think of any easy way to make it faster. Does anyone have any ideas how I can maintain the same functionality but make it run faster?

Andrew
  • 2,315
  • 3
  • 27
  • 42
  • possible duplicate of [How can I remove duplicate rows?](http://stackoverflow.com/questions/18932/how-can-i-remove-duplicate-rows) – Martin Smith Oct 08 '13 at 11:02

3 Answers3

3

You can already use a common-table-expression and ROW_NUMBER in SQL_Server 2005:

WITH CTE AS
(
    SELECT ProductCode, Language, Legislation, BrandName, Formula, Version,
       RN = ROW_NUMBER() 
                   OVER ( 
                     PARTITION BY productcode, language, legislation, brandname, formula, version 
                     ORDER BY id DESC) 
    FROM dbo.Students
)
DELETE FROM CTE WHERE RN > 1

Change DELETE to SELECT * if you want to see what you'll delete.

Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
0

You could use row_number() over (partition by productcode, languageid, legislationid, brandName, versionnumber, formulaid order by productId desc ) and delete all producitids with row_number > 1

tjibbe chris
  • 160
  • 1
  • 3
0

Assuming your productId column is a unique ID:

delete  p1
from    AllProducts p1
join    AllProducts p2
on      p1.languageId = p2.languageId
and     p1.legislationId = p2.legislationId
and     p1.brandName = p2.brandName
and     p1.versionNumber = p2.versionNumber
and     p1.formulaId = p2.formulaId
and     p1.productId < p2.productId

This will delete all put the latest records where there are matches.

If you want to delete records that match on some criteria (for example, brand name and version number only, then remove the other clauses from the join).

Mathew Collins
  • 376
  • 3
  • 14