0

I have a database table which represent people and the records have people's names in them. Some of the names have accented characters in them. Some do not. Some are non-accented duplicates of the accented version.

I need to generate a report of all of the potential duplicates by finding names that are the same (first, middle, last) except for the accents so that someone else can go through this list and verify which are true duplicates, and which are actually different people (I'm assuming they have some other way of knowing).

For example: Jose DISTINCT-LAST-NAME and José DISTINCT-LAST-NAME should be picked up as potential duplicates because they have the same characters, but one has an accented character.

How can this type of query by written in MySQL?


This question: How to remove accents in MySQL? is not the same. It is asking about de-accenting strings in-place and the poster already has a second column of data that has been de-accented. Also, the accepted answer to that question is to set the character set and collation. I have already set the character set and collation.

I am trying to generate a report that finds strings in different records that are the same except for their accents.

Community
  • 1
  • 1
FrustratedWithFormsDesigner
  • 26,726
  • 31
  • 139
  • 202
  • Possible duplicate of [How to remove accents in MySQL?](http://stackoverflow.com/questions/4813620/how-to-remove-accents-in-mysql) – Tony Chiboucas Jun 06 '16 at 20:27
  • @TonyChiboucas: The person who asked the question you linked to *already* has the second column with the unaccented string. I need to write a query to find those and compare them to potential duplicates that differ only by accents. – FrustratedWithFormsDesigner Jun 06 '16 at 20:38

3 Answers3

1

I found your question very interesting.

According to this article Accents in text searches, using "like" condition with some character collation adjustments will solve your problem. I have not tested this solution, so if it helps you, please come back and tell us.

Here is a similar question: Accent insensitive search query in MySQL, according to that, you can use something like:

where 'José' like 'Jose' collate utf8_general_ci
Community
  • 1
  • 1
  • Rather than simply linking to an external article, answers should be a complete stand-alone answer, with code samples if applicable, that includes the link as crediting a source. http://stackoverflow.com/help/how-to-answer – Tony Chiboucas Jun 06 '16 at 20:03
  • I get COLLATION 'utf8_general_ci' is not valid for CHARACTER SET 'utf8mb4' when I try this – Steven Matthews Feb 01 '22 at 18:53
  • collate utf8mb4_roman_ci (I picked it because it contains the word "Roman", though I expect any exclusively Latin script based collation would work) works for utf8mb4 – Steven Matthews Feb 01 '22 at 19:01
0

The Bane of Character Encodings

There are a wide variety of character-sets and encodings that may be used in MySQL, and when dealing with encoding it is important to learn what you can about them. In particular, take a close look at the differences between:

  • utf8_unicode_ci
  • utf8_general_ci
  • utf8_unicode_520_ci
  • utf8mb4_general_ci

Some character sets are built to include as many printable characters as possible, to support a wider range of uses, while others are built with the intent of portability and compatibility between systems. In particular, utf8_unicode_ci maps most accented characters to non-accented equivalents. Alternatively, you could use uft8_ascii_ci which is even more restrictive.

Take a look at the utf8_unicode_ci collation chart, and What's the difference between utf8_general_ci and utf8_unicode_ci .

The best answer is from a similar question, "How to remove accents in MySQL?"

If you set an appropriate collation for the column then the value within the field will compare equal to its unaccented equivalent naturally.

mysql> SET NAMES 'utf8' COLLATE 'utf8_unicode_ci';
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT 'é' = 'e';
+------------+
| 'é' = 'e' |
+------------+
|          1 |
+------------+
1 row in set (0.05 sec)

How to apply this to your situation?

SELECT id, last-name
FROM people
WHERE last-name COLLATE utf8_unicode_ci IN
(
    SELECT last-name
    FROM people
    GROUP BY last-name COLLATE utf8_unicode_ci
    HAVING COUNT(last-name)>1
)
Community
  • 1
  • 1
Tony Chiboucas
  • 5,505
  • 1
  • 29
  • 37
  • ... or something very similar. but this should be sufficient enough to get you very close. Let me know how you solve it, I'm curious. – Tony Chiboucas Jun 06 '16 at 20:38
  • I think you meant `...IN ( )` as the way it's written I get an error because the subquery returns too many rows, is that what you meant? – FrustratedWithFormsDesigner Jun 06 '16 at 20:42
  • The database, table, and columns involve are already using UTF8 character sets and collation so I'm not sure this will work. – FrustratedWithFormsDesigner Jun 06 '16 at 20:58
  • Not all UTF8 character sets are the same. In fact most of them are quite unique. The entire purpose of that specific set `utf8_unicode_ci` vs `utf8_general` is that this character encoding maps accented characters to non-accented characters, as per your request. http://collation-charts.org/mysql60/mysql604.utf8_unicode_ci.european.html – Tony Chiboucas Jun 07 '16 at 19:39
0

Well, I found something that seems to work (the real query involves a few more other fields, but the same basic idea):

select distinct p1.person_id, p1.first_name, p1.last_name, p2.last_name
from people as p1, people as p2
where binary p1.last_name <> binary p2.last_name
        and p1.last_name = p2.last_name
        and p1.first_name = p2.first_name
order by p1.last_name, p1.first_name, p2.last_name, p2.first_name;

The results look like this:

12345    Bob    Jose    José
56789    Bob    José    Jose
...

This makes sense as there are 2 records for Bob José and I know that in this case, it is the same person but one record is missing the accent.

The trick is to do a binary and non-binary compare on the "last_name" field as well as matching on all other fields. This way we can find everything that is "equal" and also not binary-equal. This works because with the current character-set/collation (utf8/utf8_general_ci), Jose and José are equal but are not binary-equal. you can try it out like this:

select 'Jose' = 'José', 'Jose' like 'José', binary 'Jose' = binary 'José';
FrustratedWithFormsDesigner
  • 26,726
  • 31
  • 139
  • 202