Python: How to test if a string contains one of strings in a list accent insensitive?

Question

I need to test if a string contains one of strings in a list, ignoring the accents.

I tried using for + in + if + unidecode but without success:

from unidecode import unidecode

def temServentiaExclusiva(nome_orgao):
     #fix-me: pegar ids dinamicamente
    regras = [
        {'especializada_id':70, 'termos': [u'orfaos e sucessoes', u'familia']}
    ]

    for r in regras:
      #if(unidecode(nome_orgao) in s for s in r['termos']):
      if([t for t in r['termos'] if(t in unidecode(nome_orgao))]):
        return r['especializada_id']


print(temServentiaExclusiva('orfãos'))
print(temServentiaExclusiva('Cartório da 6ª Vara de Orfãos e Sucessões'))

The result was None :(

So, How can I achieve that?

You may want to add code to the start of `temServentiaExclusiva()` that will look through `nome_orgao` and find any characters with accents, and then change those characters to the versions without accents before checking. — Spencer Lutz, May 11 '20 at 23:37
@SpencerLutz this a "Proof of concepts" from something bigger — celsowm, May 11 '20 at 23:40

Matvei · Answer 1 · 2020-05-12T00:01:05.017

0

You can do this with a nested for loop, rather than a list comprehension:

from unidecode import unidecode

def temServentiaExclusiva(nome_orgao):
    regras = [
        {'especializada_id':70, 'termos': [u'orfaos e sucessoes', u'familia']}
    ]

    uni_nome_orgao = unidecode(nome_orgao).lower()

    for r in regras:
      for t in r['termos']:
          if uni_nome_orgao in t or t in uni_nome_orgao:
              return r['especializada_id']

print(temServentiaExclusiva('orfãos'))

The key is getting nome_orgao into a standard format, then checking it against all the termos. As you've already done, unidecode will remove all the accents. Add .lower() to the end to make everything lowercase. Then, iterate through each r in regras and each t in termos, and check if t is in uni_nome_orgao or uni_nome_orgao is in t.

Hope that helps!

edited May 12 '20 at 00:01

answered May 11 '20 at 23:43

Matvei

323
1
6

do you know why the second one does not work? https://repl.it/repls/PurpleSizzlingAddition – celsowm May 11 '20 at 23:48
Do you want to know if nome_orgao is in any of the termos, or if any of the termos is in nome_orgao? – Matvei May 11 '20 at 23:56
could it be both? – celsowm May 12 '20 at 00:00
1

Note on the accent-stripping part: depending on the task-specific definition of "accent insensitive", `unidecode` might not be the right tool, in particular if you want to strip accents from scripts other than Latin, or if it's a problem that `unidecode` turns "€" into "EUR" and "1°" into "1deg". Because `unidecode` is strict ASCII-fication. [Here](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string/)'s a range of alternative approaches. – lenz May 12 '20 at 07:12

Python: How to test if a string contains one of strings in a list accent insensitive?

1 Answers1