I am currently working on a DataImport script, which is intended to move data from one database to another. The main problem I have come across is that the table in question contains a lot of duplicate records, with the duplicate fields being Product Code, Language, Legislation, Brand Name, Formula and Version, i.e we might have the following in the database:
My Test Product, English, UK, Test Brand, Test Formula, 1 (ID 1 - not included in group by)
My Test Product, English, UK, Test Brand, Test Formula, 1 (ID 2 - not included in group by)
My Test Product, English, UK, Test Brand, Test Formula, 1 (ID 3 - not included in group by)
My Test Product, English, UK, Test Brand, Test Formula, 1 (ID 4 - not included in group by)
As you can see these records are identical in every way. My problem is, that as part of the data load script I wish to delete records with IDs of 1, 2 and 3 whilst retaining the record with the ID of 4 as this will be the most up-to-date record and hence the one I wish to keep. To do this, I have written a T-SQL script as follows:
-- get the list of items where there is at least one duplicate
DECLARE cDuplicateList CURSOR FOR
SELECT productcode, languageid, legislationid, brandName, versionnumber, formulaid
FROM allproducts
GROUP BY productcode, languageid, legislationid, brandName, versionnumber, formulaid
HAVING COUNT (*) > 1
OPEN cDuplicateList
FETCH cDuplicateList INTO @productCode, @languageId, @legislationId, @brandName, @versionNumber, @formulaId
-- while there are still duplicates
WHILE @@FETCH_STATUS=0
BEGIN
-- delete from the table where the product ID is in the sub-query, which contains all
-- of the records apart from the last one
DELETE FROM AllProducts
WHERE productId IN
(
SELECT productId
FROM allProducts
WHERE productCode = @productCode
AND (languageId = @languageId OR @languageId IS NULL)
AND (legislationId = @legislationId OR @legislationId IS NULL)
AND (brandName = @brandName OR @brandName IS NULL)
AND (versionNumber = @versionNumber OR @versionNumber IS NULL)
AND (formulaId = @formulaId OR @formulaId IS NULL)
EXCEPT
SELECT TOP 1 productId
FROM allProducts
WHERE productCode = @productCode
AND (languageId = @languageId OR @languageId IS NULL)
AND (legislationId = @legislationId OR @legislationId IS NULL)
AND (brandName = @brandName OR @brandName IS NULL)
AND (versionNumber = @versionNumber OR @versionNumber IS NULL)
AND (formulaId = @formulaId OR @formulaId IS NULL)
)
FETCH cDuplicateList INTO @productCode, @languageId, @legislationId, @brandName, @versionNumber, @formulaId
END
Now, this does work - its just incredibly slow and I can't think of any easy way to make it faster. Does anyone have any ideas how I can maintain the same functionality but make it run faster?