Delete exact dups from a table with dups based on all columns except one

Question

I want to delete the exact duplicate records from a table keeping one. But, i can't use the intermediate table approach because the dups are on all columns except the ID column. For ex:

ID,
COL1,
Col2,
col3,
col4
The dups are on col1, col2, col3, col4

Below some samples:

ID  COL1 COL2  COL3 COL4
123 ABC  4RTFD  FGY  12346
234 ABC  4RTFD  FGY  12346
586 ABC  4RTFD  FGY  12346

here only the Id column is different, rest four columns are duplicates. I want to keep the max ID column row only.

What approach can i use here?

Thanks, Amit

Add some sample table data and the expected result - as formatted text (not images.) — jarlh, Sep 25 '17 at 07:06
it's no clear, but you can add more example to correct your question — peppe71-19, Sep 25 '17 at 07:16
Possible duplicate of [Can we delete duplicate records from a table in teradata without using intermediate table](https://stackoverflow.com/questions/19537116/can-we-delete-duplicate-records-from-a-table-in-teradata-without-using-intermedi) — Andrew, Sep 25 '17 at 16:17

SQLBadPanda · Answer 1 · 2017-09-25T08:04:31.750

Try joining the table to itself on all columns and where ID is different...

    CREATE TABLE Dups
(
    ID int IDENTITY(1,1) PRIMARY KEY,
    Col1 int NOT NULL,
    Col2 date NOT NULL,
    Col3 char(1) NOT NULL,
    Col4 char(1) NOT NULL
 )
 INSERT dbo.Dups (Col1,Col2,Col3,Col4)
 VALUES ('1','20170925','A','Z'), ('1','20170925','A','Z'), ('1','20170925','A','Z'), ('2','20170925','A','Z'), ('2','20170925','A','Z'), ('2','20170925','A','Z'), ('3','20170925','A','Z');

 SELECT * FROM Dups;

 -- This solution to retain the first ID found that is duplicated...
 DELETE FROM Dups
 WHERE ID IN (
                SELECT ID
                FROM (
                        SELECT d1.ID,
                                row_number() OVER (ORDER BY d1.ID) AS DupSeq
                        FROM dbo.Dups AS d1
                        INNER JOIN dbo.Dups AS d2 ON d2.Col1 = d1.Col1 AND d2.Col2 = d1.Col2 AND d2.Col3 = d1.Col3 AND d2.Col4 = d1.Col4
                        WHERE d1.ID <> d2.ID
                    ) AS t
                WHERE DupSeq > 1
            );

 -- This solution to retain the last ID found that is duplicated...
 DELETE FROM Dups
 WHERE ID NOT IN (
                SELECT DISTINCT
                       max(t.ID) OVER(PARTITION BY t.Col1,t.Col2,t.Col3,t.Col4 ORDER BY WindowOrder) AS KeepID
                FROM (
                        SELECT d1.ID,
                                d1.Col1,
                                d1.Col2,
                                d1.Col3,
                                d1.Col4,
                                1 AS WindowOrder
                        FROM dbo.Dups AS d1
                        LEFT OUTER JOIN dbo.Dups AS d2 ON  d2.Col1 = d1.Col1 
                                                       AND d2.Col2 = d1.Col2 
                                                       AND d2.Col3 = d1.Col3 
                                                       AND d2.Col4 = d1.Col4
                                                       AND d1.ID <> d2.ID
                    ) AS t
            );


 SELECT * FROM Dups;

DROP TABLE dbo.Dups

You'll need the row_number() in the first solution as clearly ID1 will match ID3 and so ID3 will then match ID1 as well.

In the second solution, the join is LEFT OUTER to retain those values that are not duplicated.

Thanks Nick, My requirement is slightly different. Added few samples. Let me see if the one suggested works. — user3901666, Sep 25 '17 at 07:34
Ah, I see you want to retain the last duplicate and remove the rest....I'll update my answer with a revised solution. — SQLBadPanda, Sep 25 '17 at 07:59
Hi Nick, Getting error like order based aggregate and order analytical functions not allowed in subqueries!! :( — user3901666, Sep 25 '17 at 11:16
Might be something to do with teradata, works in SQL Server Instead of subquery then insert the results of it into a temporary table and use that in the sub-query of the delete e.g. `WHERE ID NOT IN (SELECT ID FROM @temptable) — SQLBadPanda, Sep 25 '17 at 11:22

Carsten Massmann · Answer 2 · 2017-09-25T13:23:36.190

0

You can do, what many others have done before in SQL-Server (and Teradata), see here How to delete duplicate rows in sql server? or you can do it even without a CTE like

DELETE FROM (
  SELECT ROW_NUMBER()
  OVER (PARTITION BY col1,col1,col3,col4
        ORDER BY ID DESC) rn
  FROM tbl  -- tbl is "your" table ...
) t1 WHERE rn>1

It works on SQL, haven't tested it on teradata, but, since ROW_NUMBER() exists there too I, expect it to work ...

edited Sep 25 '17 at 13:23

answered Sep 25 '17 at 07:17

Carsten Massmann

26,510
2
22
43

Hi Cars10m, I don't think this will work out for me. Added few samples here. Thanks. – user3901666 Sep 25 '17 at 07:36

zarruq · Answer 3 · 2017-09-25T14:19:27.553

0

You can use correlated subquery and max function to achieve your desired result as below.

DELETE
FROM table1 t1
WHERE t1.Id <> (
        SELECT max(t2.Id)
        FROM table1 t2
        WHERE t1.col1 = t2.col1
            AND t1.col2 = t2.col2
            AND t1.col3 = t2.col3
            AND t1.col4 = t2.col4
        );

The above query assumes table1 as your table name.

select * from table1;

Result:

ID  Col1    Col2    Col3    Col4
---------------------------------
586 ABC    4RTFD    FGY    12,346

You can check demo *here

Update:

Below rows are added to the sample data set.

id  col1    col2    col3    col4
----------------------------------
345 XYZ    4FTFD    FGY     12346
745 XYZ    4FTFD    FGY     12346
945 XYZ    4FTFD    FGY     12346

Result:

id   col1    col2   col3    col4
-----------------------------------
586  ABC    4RTFD   FGY     12346
945  XYZ    4FTFD   FGY     12346

DEMO

*Note: Due to un-availability of teradata online demo tool, PostgreSQL demo has been used as correlated subquery is supported by PostgreSQL. Query has also been simulated on local teradata environment.

edited Sep 25 '17 at 14:19

answered Sep 25 '17 at 10:06

zarruq

2,445
2
10
19

Thanks Zarruq, let me try this!! – user3901666 Sep 25 '17 at 11:17
Hi @Zarruq, This is not working as i want max(id) based on certain partition as mentioned by Nick and this is giving me error in Teradata – user3901666 Sep 25 '17 at 11:42
@user3901666: What partitions? As I have added another set of sample records in demo mentioned in question and its giving the same result as "Nick's" query. Also what error are your getting in teradata as i have successfully simulated it on local TD DB instance. PFA [screenshot](http://i64.tinypic.com/vnhu8w.png) – zarruq Sep 25 '17 at 14:26
@dnoeth : Are you there? Kindly comment :-) – zarruq Sep 25 '17 at 14:27
Hi Zarruq, I need to pick only the max id based on the partitions of all the columns and excluding that max id i need to delete all the records. Through your code i am getting only the max id – user3901666 Sep 26 '17 at 08:33
That's exactly my code is doing :-). If you have 1 or more sets of data it will delete all rows from each set except `max(id)` as shown in 2nd demo. IF this isn't what you are expecting, Kindly include some more sample data and expected output in your question. Thanks! – zarruq Sep 26 '17 at 08:49

MarkusN · Answer 4 · 2017-09-26T11:46:08.170

0

Isn't that a simple usage of the grouping functionality?

select max(ID) ID, COL1, COL2, COL3
from tableA
group by 2,3,4

and save it into a new table. If it's necessary to delete the duplicate rows from the existing table you can perform following delete statement:

delete from tableA as a1
    where (
        select 1 from (
            select max(ID) ID, COL1, COL2, COL3 from tableA group by 2,3,4) a2
        where a1.ID = a2.ID
            and a1.COL1 = a2.COL1
            and a1.COL2 = a2.COL2
            and a1.COL3 = a2.COL3
         ) is null

edited Sep 26 '17 at 11:46

answered Sep 26 '17 at 11:05

MarkusN

3,051
1
18
26

This will give error like order based aggregate and order analytical functions not allowed in subqueries. I have achieved that using correlated query( exists clause) – user3901666 Sep 26 '17 at 13:17
@ user3901666 That's true, ordered analytical functions aren't allowed, this solution only uses grouping and that is allowed. I tested this code on my system (Teradata 14.10.07.17) and it works like expected. – MarkusN Sep 26 '17 at 14:04
Okay..Thanks @Markusn let me check once..will get back to you.:) – user3901666 Sep 26 '17 at 16:21

Delete exact dups from a table with dups based on all columns except one

4 Answers4