How to find duplicate emails due to case sensitivity

Question

I am trying to find the IDs that have duplicate emails due to case sensitivity.

For below table

ID      Email
101     example@gmail.com
101     EXAMPLE@gmail.com
102     email@gmail.com
102     email@gmail.com
103     la@gmail.com
103     sf@yahoo.com

The output should give below

ID      Email 
101     example@gmail.com
101     EXAMPLE@gmail.com

Only ID 101 and its email addresses should be displayed because that's the only ID that has duplicate emails due to case sensitivity.

Thank you in advance!

Also see [delete duplicates](https://stackoverflow.com/questions/3311903/remove-duplicate-rows-in-mysql) and consider changing the email column to a case insensitive collation with a unique key to prevent the addition of duplicates. — danblack, Jul 08 '22 at 00:04

score 3 · Answer 1 · edited Jul 09 '22 at 04:27

3

Use a self-join that performs both case-sensitive and case-insensitive comparisons.

(note - edited for Snowflake)


    SELECT a.id, a.email
    FROM yourTable a
    JOIN yourTable b 
        ON a.id = b.id 
        AND collate(a.email,'en-ci') = collate(b.email,'en-ci') 
        AND a.email != b.email;

edited Jul 09 '22 at 04:27

Jim Demitriou

593
4
8

answered Jul 07 '22 at 20:34

Barmar

741,623
53
500
612

thank you for your help. I am trying to get this done in the Snowflake database. I do not know the differences between utf8 formats. However, if I use both conditions as you provided, I am getting an empty table. For my case, should I use utf8mb4_general_ci or utf8mb4_bin? – EZ72 Jul 08 '22 at 10:47
@EZ72 If you are uysing snowflake, why did you tag MySQL in the first place? Close this question and ask a new one. – The Impaler Jul 08 '22 at 12:52
@TheImpaler it does not let me delete the question as people started to put the effort into it. I changed the tags. Sorry for the inconvenience. – EZ72 Jul 08 '22 at 14:21
@Barmar I just added two rows for example which is going to represent my population better. Thank you! – EZ72 Jul 08 '22 at 14:46
1

I'm not familiar with Snowflake, I don't know how to translate this. But the basic idea should be the same -- use a self-join where the emails match case-insensitively, but not case-sensitively. – Barmar Jul 08 '22 at 15:01
A minor adjustment to the code would make this approach Snowflake compatible - you can user collate(a.email, 'en-ci') = collate(b.email, 'en-ci') for the case insensitive comparison. You don't need a collate for case sensitivity, that's the default (assuming it wasn't overridden in the table definition). – Jim Demitriou Jul 09 '22 at 04:22

score 1 · Answer 2 · answered Jul 07 '22 at 20:34

1

You can do:

select * 
from t
where lower(email) in (
  select lower(email)
  from t
  group by lower(email)
  having min(email) <> max(email)
)

answered Jul 07 '22 at 20:34

The Impaler

45,731
9
39
76

If they're using a case-insensitive collation, `min(email) <> max(email)` won't be true when they differ in case. – Barmar Jul 07 '22 at 20:37
@Barmar Yes, you are right. – The Impaler Jul 08 '22 at 12:52

score 0 · Accepted Answer · answered Jul 08 '22 at 14:50

0

The self-join suggestion was helpful. This should work!

SELECT
      distinct a.id
     ,a.email
FROM table1 AS a
JOIN table1 AS b 
    ON a.id = b.id 
WHERE a.email != b.email
     AND lower(a.email) = lower(b.email)

answered Jul 08 '22 at 14:50

EZ72

61
7

How to find duplicate emails due to case sensitivity

3 Answers3