MSSQL: procedure to remove duplicates

Question

Consider the table which does not have any primary or foreign keys. I would like to write procedure which will remove all the duplicate rows given the table name.

The row should be considered duplicate of other if all of the fields are the same.

Can you suggest me if this is possible. One thing I tried is to group by every field but this approach is not universal.

Possible duplicate of [Delete duplicate records from a SQL table without a primary key](http://stackoverflow.com/questions/985384/delete-duplicate-records-from-a-sql-table-without-a-primary-key) — Tab Alleman, Apr 15 '16 at 14:03

Lukasz Szozda · Accepted Answer · 2016-04-15T12:25:29.523

You could achieve it using Dynamic-SQL

Quick backed solution (great room for improvements):

CREATE TABLE tab1(a INT, b INT);
INSERT INTO tab1(a,b) VALUES (1,1),(1,1),(1,1),(2,3);
GO

Procedure:

CREATE PROCEDURE dbo.remove_duplicates
    @tab_name SYSNAME
    ,@debug BIT = 0
AS
BEGIN
    SET NOCOUNT ON;
    -- TODO: validation if table does not exist, raise error
    -- TODO: Add @schema parameter
    -- TODO: Wrap with BEGIN TRY, omit calculated columns, CAST `TEXT/IMAGE/BINARY`....

    DECLARE @sql NVARCHAR(MAX) = 
       'WITH cte AS
        (
            SELECT *, rn = ROW_NUMBER() OVER(PARTITION BY <cols> ORDER BY (SELECT 1))
            FROM <tab_placeholder>
        )
        DELETE FROM cte
        WHERE rn <> 1;';

    DECLARE @cols NVARCHAR(MAX) = STUFF((SELECT ',' +  column_name
                                         FROM INFORMATION_SCHEMA.COLUMNS
                                         WHERE TABLE_NAME = @tab_name
                                           AND TABLE_SCHEMA = 'dbo'
                                         FOR XML PATH('')), 1, 1, '');    

    SET @sql = REPLACE(@sql, '<tab_placeholder>', QUOTENAME(@tab_name));
    SET @sql = REPLACE(@sql, '<cols>', @cols);

    IF @debug = 1 SELECT @sql;

    EXEC dbo.sp_executesql @sql;    

END
GO

Execution:

EXEC [dbo].[remove_duplicates] @tab_name = 'tab1', @debug = 1;
SELECT * FROM tab1;

LiveDemo

Thanks, actualy I was looking for general approach not to rely on table columns. But in the end even here you are getting them and then partion. But this solves my problem so accepted. — seeker, Apr 15 '16 at 12:31
The case you have to rely on table columns (one way or another) :) There is no such thing as `SELECT * FROM tab GROUP BY *` :) — Lukasz Szozda, Apr 15 '16 at 12:34

score 0 · Answer 2 · answered Apr 15 '16 at 12:20

This will remove duplicates from a table. Your partition by must contain the fields that you wish to group by to determine what a duplicate is. In your case, all of them.

IF OBJECT_ID('tempdb..#TABLE') IS NOT NULL DROP TABLE #TABLE

CREATE TABLE #TABLE ( SOMEINT INT,SOMEVALUE VARCHAR(255) )
INSERT INTO #TABLE ( SOMEINT, SOMEVALUE )
VALUES (1,'VALUE1')
,(1,'VALUE2')
,(1,'VALUE2')
,(1,'VALUE3')
,(1,'VALUE4')
,(1,'VALUE4')
,(1,'VALUE4')
,(1,'VALUE4')
,(1,'VALUE5')
,(1,'VALUE6')
,(1,'VALUE6')
,(1,'VALUE6')
,(1,'VALUE7')
,(1,'VALUE8')
,(1,'VALUE8')
,(1,'VALUE9')
,(1,'VALUE10')

;WITH dedup
AS
( 
SELECT *, ROW_NUMBER() OVER (PARTITION BY SOMEINT,SOMEVALUE ORDER BY SOMEINT ASC) AS SEQUENCE
FROM #TABLE
)
DELETE
FROM dedup
WHERE SEQUENCE > 1
GO

SELECT * FROM #TABLE

Abdul Rasheed · Answer 3 · 2016-04-15T12:27:42.467

There are number of ways

First,

Create a temp table, and copy distinct data to that temp table. Delete or truncate the data from your actual table. And copy the temp table to your actual table. Drop the temp table

SELECT DISTINCT * INTO #table1 from TABLE1
DELETE FROM TABLE1
INSERT INTO TABLE1
SELECT * FROM #table1
DROP TABLE #table1

or

Second,

Add one column to the table, update that column using ROW_NUMBER PARTITION, then remove the rows where COLUMN <> 1 . Drop the newly created column.

MSSQL: procedure to remove duplicates

3 Answers3