sql: How to remove duplicate rows (content the same, but sequence is different)

Question

You see that the SKU1 has 2 rows, but actually the content of these 2 rows are the same, just the sequence of "b" and "c" makes difference.

What if I want to remove the duplicate rows as shown in the 2nd picture?

In Oracle there is a LEAST/GREATEST function that can realize it, but I used SQL Server, therefore it doesn't work following the instruction of the below post:

How to remove duplicate rows in SQL

see http://stackoverflow.com/questions/24322259/remove-duplicate-rows-in-sql-server-query — Squirrel, Jul 22 '16 at 05:19
I can't see any duplicate row in your original table, all three columns have different values — Sandip - Frontend Developer, Jul 22 '16 at 05:26
@SandipPatel, Hi you see SKU1, the 2nd and 3rd columns are the same, just the sequence is different — Héléna, Jul 22 '16 at 05:33
Possible duplicate of [How can I remove duplicate rows?](http://stackoverflow.com/questions/18932/how-can-i-remove-duplicate-rows) — neer, Jul 22 '16 at 05:36
When you have values in different column, then it always consider as unique row — Sandip - Frontend Developer, Jul 22 '16 at 05:38
if you want to remove all duplicates, you could do something funky like this: `DELETE T FROM (SELECT *, ROW_NUMBER() OVER (PARTITION BY mT.SKUCol, Z.A, Z.B ORDER BY (SELECT NULL)) RN FROM myTable mT CROSS APPLY (SELECT MIN(X), MAX(X) FROM (VALUES (col1), (col2)) Y(X)) Z(A, B)) T WHERE RN > 1;` — ZLK, Jul 22 '16 at 05:45
Hi I am finding the same result which Héléna is looking for using the query below — Rohit Gupta, Jul 22 '16 at 06:11
I think the question is just about filtering duplicate rows from the results, it is very simple to do that: I've described it [here](http://stackoverflow.com/a/38571586/1016343). The other question mentioned, how to remove duplicates is about physically removing them from the database table, so for me this seems to be no duplicate question. — Matt, Aug 12 '16 at 07:56

score 1 · Answer 1 · answered Jul 22 '16 at 05:46

Please Use Max() and Min Function instead of least and greatest of oracle if used the follwoing steps and got the same result.

Create Table Transactions (Name varchar(255),Quantity1 int,Quantity2 int)
   Insert Into Transactions values
   ('SKU1',10,20),
      ('SKU1',20,10),
   ('SKU2',10,20),
   ('SKU2',10,20)

Now I used the query below to find the solution of your answer

    Select T1.Name,MAX(T1.Quantity1),MIN(T2.Quantity2) From Transactions   T1
    join Transactions T2
    on T1.Name=T2.Name
    group by T1.Name

Please Reply

`max()` and `min()` do something completely different then `greatest()` and `least()` — , Jul 22 '16 at 10:44

LukStorms · Accepted Answer · 2016-08-12T08:00:42.977

If it's only 2 columns where the order should not matter for the group by?

Then you could use IIF (or a CASE WHEN) to calculate the maximum and minimum values.
And use those calculated values in the GROUP BY.

For example:

select Name, 
MAX(Val1) as Val1,
MIN(Val2) as Val2
from Table1
GROUP BY Name,
IIF(Val2 is null or Val1 < Val2, Val1, Val2),
IIF(Val1 is null or Val1 < Val2, Val2, Val1);

For the example records that would give the result:

Name Val1 Val2
SKU1 20   10
SKU2 20   10

Or if you want to use a fancy XML trick :

select Name, max(Val1) as Val1, min(Val2) as Val2
from (
  select *, 
  cast(
    convert(XML, 
      concat('<n>',Val1,'</n><n>',Val2,'</n>')
    ).query('for $n in /n order by $n return string($n)'
  ) as varchar(6)) as SortedValues
  from Table1
) q
group by Name, SortedValues;

The last method could be more usefull when there are more columns involved.

To actually remove the duplicates?
Here's an example that uses a table variable to demonstrate:

declare @Table1 TABLE (Id int, Name varchar(20), Val1 int, Val2 int);

Insert Into @Table1 values
(1,'SKU1',10,20),
(2,'SKU1',20,10),
(3,'SKU1',12,15),
(4,'SKU2',10,null),
(5,'SKU2',null,10),
(6,'SKU2',10,20);

delete from @Table1
where Id in (
    select Id
    from (
    select Id, 
    row_number() over (partition by Name, 
         IIF(Val2 is null or Val1 < Val2, Val1, Val2),
         IIF(Val1 is null or Val1 < Val2, Val2, Val1)
       order by Val1 desc, Val2 desc
    ) as rn
    from @Table1
    ) q
    where rn > 1
);

select * from @Table1;

score 1 · Answer 3 · answered Jul 22 '16 at 10:42

1

greatest() can be simulated using a CASE expression

greatest(b,c) is the same as:

case 
   when b > c then b
   else c
end

You can use this together with a distinct to remove your duplicates:

select distinct 
          a, 
          case when b > c then b else c end as x
from the_table
order by a;

answered Jul 22 '16 at 10:42

,this method is simple and works, I can use this result to left join with the original table to get the result,many thanks~ – Héléna Aug 12 '16 at 05:31

Mr Anderson · Answer 4 · 2016-07-22T13:25:47.580

1

Try %%physloc%%. It is the equivalent of oracle's RowId.

Find it

select *, %%physloc%% from [MyTable] where ...
Delete what you want

delete from [MyTable] where %%physloc%% = 0xDEADBEEF -- (your address)
Consider adding a Unique / Primary Key to prevent future occurrences.

edited Jul 22 '16 at 13:25

answered Jul 22 '16 at 13:16

Mr Anderson

2,200
13
23

score 1 · Answer 5 · edited Jul 25 '16 at 18:28

1

SELECT * FROM abc where A='SKU1'and B=20 || A='SKU2'and B=10 

 a      b    c  
SKU1    20  10
SKU2    10  20

edited Jul 25 '16 at 18:28

Nick

7,103
2
21
43

answered Jul 25 '16 at 14:17

anand

11
2

This syntax is invalid in T-SQL – Matt Jul 25 '16 at 15:22
`||` is a concatenation operator for some DBMSs, though not SQL Server and not to be used in this way. Perhaps you meant to use the `OR` operator? – Nick Jul 25 '16 at 18:27
yes ,i meant to use OR operator ,but this syntax is valid in Mysql. – anand Jul 26 '16 at 05:28

score 1 · Answer 6 · edited May 23 '17 at 10:29

From your question it is not clear whether you want to filter duplicates by row or duplicates by column. Let me describe both to make sure your question is addressed completely.

In Example 1, you can see that we have duplicate rows:

To filter them, just add the keyword DISTINCT to your query, as follows:

SELECT DISTINCT * FROM myTable;

It filters the duplicate rows and returns:

Hence, you don't need a least or greatest function in this case.

In Example 2, you can see that we have duplicates in the columns:

Here, SELECT DISTINCT * from abc will still return all 4 rows. If we regard only the first column in the filtering, it can be achieved by the following query:

select distinct t.Col1,
    (select top 1 Col2 from myTable ts where t.Col1=ts.Col1) Col2,
    (select top 1 Col3 from myTable ts where t.Col1=ts.Col1) Col3
from myTable t

It will pick the first matching value in each column, so the result of the query will be:

The difference between Example 1 and this example is that it eliminated just the duplicate occurances of the values in Col1 of myTable and then returned the related values of the other columns - hence the results in Col1 and Col2 are different.

Note:

In this case you can't just join the table myTable because then you would be forced to list the columns in the select distinct which would return more rows then you want to have. Unfortunately, T-SQL does not offer something like SELECT DISTINCT ON(fieldname) i.e. you can't directly specify a distinct (single) fieldname.
You might have thought "why not use GROUP BY?" The answer of that question is here: With GROUP BY you are forced to either specify all columns which is a technical DISTINCT equivalent, or you need to use aggregate functions like MIN or MAX which aren't returning what you want either.

A more advanced query (you might have seen is once before!) which has the same result is:

SELECT Col1, Col2, Col3 
FROM    (
        SELECT *, ROW_NUMBER() OVER (PARTITION BY Col1 ORDER BY Col1) AS RowNumber 
        FROM myTable
        ) t
WHERE RowNumber=1

This statement numbers each occurance of a value on Col1 in the subquery and then takes the first of each duplicate rows - which effectively is a grouping by Col1 (but without the disadvantages of GROUP BY).

N.B. In the examples above, I am assuming a table definition like:

CREATE TABLE [dbo].[myTable](
    [Col1] [nvarchar](max) NULL,
    [Col2] [int] NULL,
    [Col3] [int] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

For the examples above, we don't need to declare a primary key column. But generally spoken you'll need a primary key in database tables, to be able to reference rows efficiently.

If you want to permanently delete rows not needed, you should introduce a primary key, because then you can delete the rows not displayed easily as follows (i.e. it is the inverse filter of the advanced query mentioned above):

DELETE FROM [dbo].[myTable]
WHERE  myPK NOT IN
    (SELECT myPK  
        FROM    (
            SELECT *, ROW_NUMBER() OVER (PARTITION BY Col1 ORDER BY Col1) AS RowNumber 
            FROM [dbo].[myTable]
        ) t
    WHERE RowNumber=1 and myPK=t.myPK)

This assumes you have added an integer primary key myPK which auto-increments (you can do that via the SQL Management Studio easily by using the designer).

Or you can execute the following query to add it to the existing table:

BEGIN TRANSACTION
GO
ALTER TABLE dbo.myTable ADD
    myPK int NOT NULL IDENTITY (1, 1)
GO
ALTER TABLE dbo.myTable ADD CONSTRAINT
    PK_myTable PRIMARY KEY CLUSTERED (myPK) 
    WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, 
            ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) 
    ON [PRIMARY]

GO
ALTER TABLE dbo.myTable SET (LOCK_ESCALATION = TABLE)
GO
COMMIT

You can find some more examples here at MSDN.

score 1 · Answer 7 · answered Jul 30 '16 at 06:52

This may seem a little complicated at first, but we can also use PIVOT/UNPIVOT to obtain the results

Below is the query

select * 
from 
    (
    select 
        *, 
        'quantity'+ cast(row_number() over (partition by name order by data) as nvarchar) cols  
    from
        (
            select 
                distinct name, data 
            from 
                (select * from transactions)s
            unpivot
            (
                data for cols in (quantity1,quantity2)
            )u
        )s
    )s
pivot
(
    max(data) for cols in (quantity1,quantity2)
)p

sql: How to remove duplicate rows (content the same, but sequence is different)

7 Answers7