Compare 2 strings without considering accents in Python

Question

I would like to compare 2 strings and have True if the strings are identical, without considering the accents.

Example : I would like the following code to print 'Bonjour'

if 'séquoia' in 'Mon sequoia est vert':
    print 'Bonjour'

Convert to fully decomposed normal form, remove accents, compare. — tripleee, Dec 22 '13 at 13:17
Linked: https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string — Basj, May 22 '20 at 17:20

score 12 · Accepted Answer · edited Dec 22 '13 at 13:34

12

You should use unidecode function from Unidecode package:

from unidecode import unidecode

if unidecode(u'séquoia') in 'Mon sequoia est vert':
    print 'Bonjour'

edited Dec 22 '13 at 13:34

vikingosegundo

52,040
14
137
178

answered Dec 22 '13 at 13:24

Suor

2,845
1
22
28

score 6 · Answer 2 · edited May 23 '17 at 12:07

6

You should take a look at Unidecode. With the module and this method, you can get a string without accent and then make your comparaison:

def remove_accents(data):
    return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters).lower()


if remove_accents('séquoia') in 'Mon sequoia est vert':
    # Do something
    pass

Reference from stackoverflow

edited May 23 '17 at 12:07

Community

1
1

answered Dec 22 '13 at 13:18

Maxime Lorant

34,607
19
87
97

This would not work if the word was "séQuoIa" since the `remove_accents` method makes all of the characters lowercase. – Benjamin Lowry Jun 27 '17 at 14:31

Javier Buzzi · Answer 3 · 2020-05-08T11:01:23.260

(sorry, late to the party!!)

How about instead, doing this:

>>> unicodedata.normalize('NFKD', u'î ï í ī į ì').encode('ASCII', 'ignore')
'i i i i i i'

No need to loop over anything. @Maxime Lorant answer is very inefficient.

>>> import timeit
>>> code = """
import string, unicodedata
def remove_accents(data):
    return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters).lower()
"""
>>> timeit.timeit("remove_accents(u'séquoia')", setup=code)
3.6028339862823486
>>> timeit.timeit("unicodedata.normalize('NFKD', u'séquoia').encode('ASCII', 'ignore')", setup='import unicodedata')
0.7447490692138672

Hint: less is better

Also, I'm sure the package unidecode @Seur suggested has other advantages, but it is still very slow compared to the native option that requires no 3rd party libraries.

>>> timeit.timeit("unicodedata.normalize('NFKD', u'séquoia').encode('ASCII', 'ignore')", setup="import unicodedata")
0.7662729263305664
>>> timeit.timeit("unidecode.unidecode(u'séquoia')", setup="import unidecode")
7.489392042160034

Hint: less is better

Putting it all together:

clean_text = unicodedata.normalize('NFKD', u'séquoia').encode('ASCII', 'ignore')
if clean_text in 'Mon sequoia est vert':
    ...

Compare 2 strings without considering accents in Python

3 Answers3

Linked