How to find duplicate entries from a SQL Server table based on some column value?

Question

Here is my SQL Server table - I need to find duplicate customer code based on Customer Name, Date of birth, Father Name columns:

Customer code   Customer Name   Date of birth   Father Name
-------------------------------------------------------------
0001            Md. Alam        1991-10-20      Sr. Alam 
0002            Alam            1991-10-20      Sr. alam
0004            Hasan           1990-01-01      Sr. Hasan
0005            Karim           1988-01-01      Sr. Karim
0006            Karim           1988-01-01      S Karim
0007            Kalam           1985-01-01      Sr. Kalam

Output looks lik:

0001,0002,0005,0006 customer are duplicates because similarities of Customer Name, Date of birth and Father Name columns.

Customer code   Customer Name   Date of birth   Father Name
------------------------------------------------------------
0001            Md. Alam        1991-10-20      Sr. Alam 
0002            Alam            1991-10-20      Sr. alam
0005            Karim           1988-01-01      Sr. Karim
0006            Karim           1988-01-01      S Karim

Please help me to find out an effective way

While asking a question, you need to provide a **minimal reproducible example**. Please refer to the following link: https://stackoverflow.com/help/minimal-reproducible-example Please provide the following: (1) DDL and sample data population, i.e. CREATE table(s) plus INSERT T-SQL statements. (2) What you need to do, i.e. logic and your code attempt implementation of it in T-SQL. (3) Desired output, based on the sample data in #1 above. (4) Your SQL Server version (SELECT @@version;) — Yitzhak Khabinsky, May 30 '21 at 15:42

prem · Answer 1 · 2021-05-30T16:06:16.417

1

select * from customer where code in(
select c1.code from customer c1 inner join customer c2 on c1.dob = c2.dob and 
(c1.father = c2.father or c1.name = c2.name)
group by c1.code having count(c1.code) > 1

edited May 30 '21 at 16:06

answered May 30 '21 at 15:57

prem

223
3
13

0001 Md. Alam 1991-10-20 Sr. Alam 0002 Alam 1991-10-20 Sr. alam Here two duplicate row name(Not same), DOB (Same), Father's name (Same) So, i think c1.father = c2.father or c1.name = c2.name is not working here – Ashraful20 May 30 '21 at 16:31

score 0 · Answer 2 · edited May 31 '21 at 02:09

0

SELECT t.Customer_Code
    ,t.Customer_Name
FROM tb2 AS t
WHERE t.Customer_Name IN (
        SELECT t.Customer_Name
        FROM tb2 AS t
        GROUP BY t.Customer_Name
        HAVING COUNT(t.Customer_Name) > 1
        )
ORDER BY t.Customer_Code ASC

edited May 31 '21 at 02:09

Amit11794

148
1
2
15

answered May 30 '21 at 16:19

Anjum Raza

1
2

always try to add formatted query and use option "{ } - Add Sample Code", while adding query or code. – Amit11794 May 30 '21 at 16:32

Yitzhak Khabinsky · Accepted Answer · 2021-05-30T17:23:51.360

Please try the following solution.

1st CTE is cleansing two columns CustomerName and FatherName.
2nd CTE is creating buckets based on the combination of 3 columns: CustomerName, DOB, and FatherName.
3rd CTE is creating a counter of rows in a bucket, i.e. finds duplicates.
Final SELECT is joining back to the original table and filtering out not duplicate rows.

SQL

-- DDL and sample data population, start
DECLARE @tbl TABLE (CustomerCode CHAR(4), CustomerName VARCHAR(20), DOB DATE, FatherName VARCHAR(20));
INSERT INTO @tbl (CustomerCode, CustomerName, DOB, FatherName) VALUES
('0001', 'Md. Alam', '1991-10-20', 'Sr. Alam'),
('0002', 'Alam', '1991-10-20', 'Sr. alam'),
('0004', 'Hasan', '1990-01-01', 'Sr. Hasan'),
('0005', 'Karim', '1988-01-01', 'Sr. Karim'),
('0006', 'Karim', '1988-01-01', 'S Karim'),
('0007', 'Kalam', '1985-01-01', 'Sr. Kalam');
-- DDL and sample data population, end

WITH rs AS
(
    SELECT CustomerCode, DOB
        , RIGHT(CustomerName, LEN(CustomerName) - c.pos) AS CustomerName
        , RIGHT(FatherName, LEN(FatherName) - f.pos) AS FatherName
    FROM @tbl
        CROSS APPLY (SELECT CHARINDEX(SPACE(1), CustomerName)) AS c(pos)
        CROSS APPLY (SELECT CHARINDEX(SPACE(1), FatherName)) AS f(pos)
), cte AS
(
    SELECT * 
    , ROW_NUMBER() OVER (ORDER BY CustomerCode) -
     ROW_NUMBER() OVER (PARTITION BY rs.CustomerName, rs.DOB, rs.FatherName ORDER BY rs.CustomerCode) AS bucket
    FROM rs
), cte2 AS
(
    SELECT CustomerCode, bucket, COUNT(bucket) OVER (PARTITION BY cte.bucket) AS [counter]
    FROM cte
    GROUP BY  CustomerCode, bucket
)
SELECT t.* FROM @tbl AS t
INNER JOIN cte2 ON cte2.CustomerCode = t.CustomerCode
WHERE cte2.counter > 1;

Output

+--------------+--------------+------------+------------+
| CustomerCode | CustomerName |    DOB     | FatherName |
+--------------+--------------+------------+------------+
|         0001 | Md. Alam     | 1991-10-20 | Sr. Alam   |
|         0002 | Alam         | 1991-10-20 | Sr. alam   |
|         0005 | Karim        | 1988-01-01 | Sr. Karim  |
|         0006 | Karim        | 1988-01-01 | S Karim    |
+--------------+--------------+------------+------------+

Thank you so much brother.It's working. It's so much helpful for me. — Ashraful20, May 31 '21 at 15:40

score 0 · Answer 4 · answered May 31 '21 at 02:27

If I am understanding it correctly, you are considering customer 0001 and 0002 as duplicates when customer name clearly not matching as a whole.

I believe you need some kind of fuzzy search here, which can match 2 strings and gives a percentage of match based on what you can decide if 2 entries are duplicate or not.

Similar questions answered here. SQL Server Fuzzy Search with Percentage of match

How to find duplicate entries from a SQL Server table based on some column value?

4 Answers4