2

I've inherited a SQL Server database that has duplicate data in it. I need to find and remove the duplicate rows. But without an id field, I'm not sure how to find the rows.

Normally, I'd compare it with itself using a LEFT JOIN and check that all fields are the same except the ID field would be table1.id <> table2.id, but without that, I don't know how to find duplicates rows and not have it also match on itself.

TABLE:

productId int not null,
categoryId int not null,
state varchar(255) not null,
dateDone DATETIME not null

SAMPLE DATA

1, 3, "started", "2016-06-15 04:23:12.000"
2, 3, "started", "2016-06-15 04:21:12.000"
1, 3, "started", "2016-06-15 04:23:12.000"
1, 3, "done", "2016-06-15 04:23:12.000"

In that sample, only rows 1 and 3 are duplicates.

How do I find duplicates?

Don Rhummy
  • 24,730
  • 42
  • 175
  • 330
  • 1
    You can use a cte add a `row_number()` with partition and then delete where row_num > 1 I will work on example – Matt Jun 16 '16 at 20:12
  • @Matt Can you give an example? – Don Rhummy Jun 16 '16 at 20:14
  • All there for you, your test data, select before and after the delete. One a side note if you are able to change the schema you can always add an identity column later. – Matt Jun 16 '16 at 20:18
  • 1
    Possible duplicate of [How to delete duplicate rows in sql server?](http://stackoverflow.com/questions/18390574/how-to-delete-duplicate-rows-in-sql-server) – Juan Carlos Oropeza Jun 16 '16 at 20:25

4 Answers4

8

Use having (and group by)

select 
    productId 
  , categoryId 
  , state
  , dateDone
  , count(*)
from your_table 
group by productId ,categoryId ,state, dateDone
having count(*) >1
ScaisEdge
  • 131,976
  • 10
  • 91
  • 107
  • 1
    This query didn't work. It gives lots of results that aren't duplicates – Don Rhummy Jun 16 '16 at 21:14
  • I have update the answer with the cont(*) . so you can check if the grouped row are duplicated or not .. otherwise . .. let me see a proper sample .. and i check for the right result . – ScaisEdge Jun 16 '16 at 21:21
1

For some reason I thought you wanted to delete them I guess I read that wrong but just switch DELETE in my statement to SELECT and now you have all of the duplicates and not the original. But using DELETE will remove all duplicates and still leave you 1 record which I suspect is your desire.

IF OBJECT_ID('tempdb..#TT') IS NOT NULL
    BEGIN
        DROP TABLE #TT
    END

CREATE TABLE #TT (
    productId int not null,
    categoryId int not null,
    state varchar(255) not null,
    dateDone DATETIME not null
)

INSERT INTO #TT (productId, categoryId, state, dateDone)
VALUES (1, 3, 'started', '2016-06-15 04:23:12.000')
,(2, 3, 'started', '2016-06-15 04:21:12.000')
,(1, 3, 'started', '2016-06-15 04:23:12.000')
,(1, 3, 'done', '2016-06-15 04:23:12.000')


SELECT *
FROM
    #TT

;WITH cte AS (
    SELECT
       *
       ,RowNum = ROW_NUMBER() OVER (PARTITION BY productId, categoryId, state, dateDone ORDER BY productId) --note what you order by doesn't matter

    FROM

           #TT
    )

--if you want to delete them just do this otherwise change DELETE TO SELECT
    DELETE
    FROM
        cte
    WHERE
        RowNum > 1

    SELECT *
    FROM
        #TT

If you want to and can change schema you can always add an identity column after the fact too and it will populate the existing record

ALTER TABLE #TT
ADD Id INTEGER IDENTITY(1,1) NOT NULL
Matt
  • 13,833
  • 2
  • 16
  • 28
  • Do you have a SQL fiddle for this? I couldn't get it working – Don Rhummy Jun 16 '16 at 22:44
  • What DB are you using? you taged sql-server, sql-server2008, 2012, 2005, etc. SQL Fiddle is MySQL... If you copy everything in the first box and paste in SQL 2012 (which was tagged) SSMS it should work for you and the I am pretty sure the features I am using go back to 2005, I wrote it in SSMS on 2014. http://sqlmag.com/blog/window-functions-over-clause-help-make-difference – Matt Jun 16 '16 at 22:52
  • SQL Fiddle can also do SQL Server (it's in the upper left corner to change it). They have these databases for a few different customers and they're all on different versions of SQL Server. That's why I tagged it that way. – Don Rhummy Jun 16 '16 at 23:53
  • @DonRhummy do you not have access to SQL Server SSMS? I just got on SQLFiddle and even their own examples won't load for sql-server 2014 or 2008. "SELECT 1 as 'a' won't even execute on sqlfiddle right now. My script works in SQL-server. 2 of the other answers after mine are using the same technique just not with the delete... I would have assumed you have access to SQL 2005 or newer somewhere no? If so copy and paste my code to ssms and it should run. If not I guess I am not sure why I am answering the question. – Matt Jun 17 '16 at 00:21
  • I will have access, but not till tomorrow. Was hoping to have something to test before then. No worries. Will do tomorrow. – Don Rhummy Jun 17 '16 at 02:23
  • I understand, I am confident it will work for you. I remember the first time I saw the technique to delete from a cte from a dba of a vendor that works with my company I was excited! have a good night – Matt Jun 17 '16 at 04:08
1

You can do this with windowing functions. For instance

create table #tmp
   (
        Id INT
   )


insert into #tmp
VALUES (1), (1), (2) --so now we have duplicated rows



WITH CTE AS 
    (
     SELECT 
       ROW_NUMBER() OVER(PARTITION BY Id ORDER BY Id) AS [DuplicateCounter], 
       Id
     FROM #tmp
    )
DELETE FROM CTE
WHERE DuplicateCounter > 1 --duplicated rows have DuplicateCounter > 1
Deealga
  • 11
  • 1
0

You can try CTE and then limit the actual selection from the CTE to where RN = 1. Here is the query:-

;WITH ACTE 
AS 
(
    SELECT ProductID, categoryID, State, DateDone,
    RN = ROW_NUMBER() OVER(PARTITION BY ProductID, CategoryID, State, DateDone 
                            ORDER BY ProductID, CategoryID, State, DateDone) 
    FROM [Table] 
 ) 

SELECT * FROM ACTE WHERE RN = 1    
Mahedi Sabuj
  • 2,894
  • 3
  • 14
  • 27
Mark
  • 41
  • 4