How do I find duplicates across multiple columns?

Question

So I want to do something like this sql code below:

select s.id, s.name,s.city 
from stuff s
group by s.name having count(where city and name are identical) > 1

To produce the following, (but ignore where only name or only city match, it has to be on both columns):

id      name  city   
904834  jim   London  
904835  jim   London  
90145   Fred  Paris   
90132   Fred  Paris
90133   Fred  Paris

score 179 · Accepted Answer · answered Nov 16 '11 at 09:26

179

Duplicated id for pairs name and city:

select s.id, t.* 
from [stuff] s
join (
    select name, city, count(*) as qty
    from [stuff]
    group by name, city
    having count(*) > 1
) t on s.name = t.name and s.city = t.city

answered Nov 16 '11 at 09:26

Michał Powaga

22,561
8
51
62

Note that if either ```name``` or ```city``` contain ```null```, then they will fail to be reported in the outer query, but will be matched in the inner query. – Adam Parkin Jan 13 '15 at 23:52
5

If the values can possibly contain ```null``` then (unless I'm missing something) you need to change it to a ```CROSS JOIN``` (full Cartesian product) and then add a ```WHERE``` clause such as: ```WHERE ((s.name = t.name) OR (s.name is null and t.name is null)) AND ((s.city = t.city) OR (s.city is null and t.city is null))``` – Adam Parkin Jan 14 '15 at 16:03
This answer will not return unique IDs of each duplicated record. Instead, it will merge the duplicated records, into a single record, and choose whichever ID appears first in the table. I believe the answer by @ssarabando is a more appropriate answer. – Crayons Aug 17 '21 at 20:45
1

@Crayons Both answers will return the same results. See https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=8bf900c22bbb263ef792f73e8355cdc2 and https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=8e94833cb4dd27a1974a1a362ff2a11c – Jack B Dec 08 '21 at 14:59

score 114 · Answer 2 · edited May 07 '15 at 19:04

114

 SELECT name, city, count(*) as qty 
 FROM stuff 
 GROUP BY name, city HAVING count(*)> 1

edited May 07 '15 at 19:04

Plamen G

4,729
4
33
44

answered May 07 '15 at 18:59

Sunnny

1,173
1
7
3

2

With that, you can't know the id of each line. – Juan.Queiroz Sep 28 '20 at 14:42
Replace `SELECT name, city, count(*) as qty ` with `SELECT *` to see all columns, including id. – yoyo May 19 '21 at 15:45
2

@yoyo Your suggestion gives an error; please suggest the entire SQL which does not give an error. – nutty about natty Aug 16 '21 at 18:28
1

To also see (at least) the min/max values of ID: `SELECT max(id), min(id), name, city, count(*) as qty FROM stuff GROUP BY name, city HAVING count(*)> 1` – nutty about natty Aug 16 '21 at 18:30
1

This answer doesn't address the question, in that it does not return the unique ID. – Crayons Aug 17 '21 at 20:42

score 26 · Answer 3 · answered Nov 16 '11 at 09:25

26

Something like this will do the trick. Don't know about performance, so do make some tests.

select
  id, name, city
from
  [stuff] s
where
1 < (select count(*) from [stuff] i where i.city = s.city and i.name = s.name)

answered Nov 16 '11 at 09:25

ssarabando

3,397
2
36
42

1

This is an under-rated answer and I believe it's the best one here. This answer identifies duplicates, while returning individual records and their unique ID's. The marked answer groups the results, meaning you cannot actually identify the duplicates by their unique ID's, and is therefore a less useful dataset. – Crayons Aug 17 '21 at 20:48
1

@Crayons Both answers will return the same results. See https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=8bf900c22bbb263ef792f73e8355cdc2 and https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=8e94833cb4dd27a1974a1a362ff2a11c – Jack B Dec 08 '21 at 15:00

score 9 · Answer 4 · edited Jun 20 '20 at 09:12

Using count(*) over(partition by...) provides a simple and efficient means to locate unwanted repetition, whilst also list all affected rows and all wanted columns:

SELECT
    t.*
FROM (
    SELECT
        s.*
      , COUNT(*) OVER (PARTITION BY s.name, s.city) AS qty
    FROM stuff s
    ) t
WHERE t.qty > 1
ORDER BY t.name, t.city

While most recent RDBMS versions support count(*) over(partition by...) MySQL V 8.0 introduced "window functions", as seen below (in MySQL 8.0)

CREATE TABLE stuff(
   id   INTEGER  NOT NULL
  ,name VARCHAR(60) NOT NULL
  ,city VARCHAR(60) NOT NULL
);

INSERT INTO stuff(id,name,city) VALUES 
  (904834,'jim','London')
, (904835,'jim','London')
, (90145,'Fred','Paris')
, (90132,'Fred','Paris')
, (90133,'Fred','Paris')

, (923457,'Barney','New York') # not expected in result
;

SELECT
    t.*
FROM (
    SELECT
        s.*
      , COUNT(*) OVER (PARTITION BY s.name, s.city) AS qty
    FROM stuff s
    ) t
WHERE t.qty > 1
ORDER BY t.name, t.city

    id | name | city   | qty
-----: | :--- | :----- | --:
 90145 | Fred | Paris  |   3
 90132 | Fred | Paris  |   3
 90133 | Fred | Paris  |   3
904834 | jim  | London |   2
904835 | jim  | London |   2

db<>fiddle here

Window functions. MySQL now supports window functions that, for each row from a query, perform a calculation using rows related to that row. These include functions such as RANK(), LAG(), and NTILE(). In addition, several existing aggregate functions now can be used as window functions; for example, SUM() and AVG(). For more information, see Section 12.21, “Window Functions”.

score 7 · Answer 5 · edited Feb 21 '23 at 22:45

7

I found this way to be pretty flexible / efficient

select 
    s1.id
    ,s1.name
    ,s1.city 
from 
    stuff s1
    ,stuff s2
Where
    s1.id <> s2.id
    and s1.name = s2.name
    and s1.city = s2.city

edited Feb 21 '23 at 22:45

TylerH

20,799
66
75
101

answered Apr 09 '19 at 17:53

MattD

71
1
1

select **distinct** ... might be what's needed/missing here, no? – nutty about natty Aug 16 '21 at 18:20

score 5 · Answer 6 · edited Feb 21 '23 at 22:46

5

SELECT Feild1, Feild2, COUNT(*)
FROM table name
GROUP BY Feild1, Feild2
HAVING COUNT(*)>1

This will give you all your answers.

edited Feb 21 '23 at 22:46

TylerH

20,799
66
75
101

answered Apr 30 '21 at 09:31

Arunav dutta gupta

79
1
5

score 2 · Answer 7 · answered Nov 16 '11 at 09:22

2

You have to self join stuff and match name and city. Then group by count.

select 
   s.id, s.name, s.city 
from stuff s join stuff p ON (
   s.name = p.city OR s.city = p.name
)
group by s.name having count(s.name) > 1

answered Nov 16 '11 at 09:22

Anja

324
1
2

2

Fails in SQL Server: all non-aggregate columns must be in the GROUP BY – gbn Nov 16 '11 at 09:30

D-Shih · Answer 8 · 2022-05-11T07:28:40.073

From OP question, OP wants to group columns and get additional columns that aren't grouping columns.

so that regular group by + having might not be worked.

I would use EXISTS subquery with HAVING.

we can try to add columns which you want to mark duplicate in a subquery.

SELECT s.id, s.name,s.city 
FROM stuff s
WHERE EXISTS (
   SELECT 1
   FROM stuff ss
   WHERE 
      s.name = ss.name
   AND
      s.city = ss.city
   GROUP BY ss.name,ss.city
   HAVING COUNT(*) > 1
)

If we create a suitable Index might get better performance than join

CREATE INDEX IX_name ON stuff (
    name,
    city
);

Another way we can use COUNT window function with filter condition to make it which add grouping columns in PARTITION BY part

SELECT s.id, s.name,s.city 
FROM (
   SELECT *,COUNT(*) OVER(PARTITION BY name,city) cnt
   FROM stuff 
) s
WHERE cnt > 1

sqlfiddle

score 1 · Answer 9 · answered Aug 26 '22 at 19:13

It's a pleasure to add another way of achieving the required output using Cross Apply here like below :

select s.* from stuff s
cross apply(
    select name, city from stuff
    group by name, city
    having Count(*) > 1) x
where s.name = x.name and s.city=x.city

score -1 · Answer 10 · edited Nov 14 '17 at 15:18

Given a staging table with 70 columns and only 4 representing duplicates, this code will return the offending columns:

SELECT 
    COUNT(*)
    ,LTRIM(RTRIM(S.TransactionDate)) 
    ,LTRIM(RTRIM(S.TransactionTime))
    ,LTRIM(RTRIM(S.TransactionTicketNumber)) 
    ,LTRIM(RTRIM(GrossCost)) 
FROM Staging.dbo.Stage S
GROUP BY 
    LTRIM(RTRIM(S.TransactionDate)) 
    ,LTRIM(RTRIM(S.TransactionTime))
    ,LTRIM(RTRIM(S.TransactionTicketNumber)) 
    ,LTRIM(RTRIM(GrossCost)) 
HAVING COUNT(*) > 1

.

How do I find duplicates across multiple columns?

10 Answers10

Linked

Related