3

I'd like to create a regex statement in Python 2.7.8 that will substitute characters. It will work like this...

ó -> o
ú -> u
é -> e
á -> a
í -> i
ù,ú  -> u

These are the only unicode characters that I would like to change. Such unicode characters as, ë, ä I don't want to change. So the word, thójlà will become tholja. I'm sure there is a way so that I don't have to create all the regex separately like below.

word = re.sub(ur'ó', ur'o', word)
word = re.sub(ur'ú', ur'u', word)
word = re.sub(ur'é', ur'e', word)
....

I've been trying to figure this out but haven't had any luck. Any help is appreciated!

user2743
  • 1,423
  • 3
  • 22
  • 34

3 Answers3

4

Try with str.translate and maketrans...

print('thójlà'.translate(str.maketrans('óúéáíùú', 'oueaiuu')))
# thojlà

This way you ensure the only substitutions you want to make.

If you had many strings to change, you should assign your maketrans to a variable, like

table = str.maketrans('óúéáíùú', 'oueaiuu')

and then, each string can be translated as

s.translate(table)
chapelo
  • 2,519
  • 13
  • 19
  • 1
    Nice. "There should be one-- and preferably only one --obvious way to do it." and this is it. – twasbrillig Dec 08 '14 at 23:42
  • Note, the code above is for Python 3. In Python 2 it's `string`, not `str`: `print 'thójlà'.translate(string.maketrans('óúéáíùú', 'oueaiuu'))` – twasbrillig Dec 08 '14 at 23:52
3

With String's replace() function you can do something like:

x = "thójlà"                  
>>> x
'thójlà'
>>> x = x.replace('ó','o')
'thojlà'
>>> x = x.replace('à','a')
'thojla'

A generalized way:

# -*- coding: utf-8 -*-

replace_dict = {
    'á':'a',
    'à':'a',
    'é':'e',
    'í':'i',
    'ó':'o',
    'ù':'u',
    'ú':'u'
}

str1 = "thójlà"

for key in replace_dict:
    str1 = str1.replace(key, replace_dict[key])

print(str1) #prints 'thojla'

A third way, if your list of character mappings is getting too large:

# -*- coding: utf-8 -*-

replace_dict = {
    'a':['á','à'],
    'e':['é'],
    'i':['í'],
    'o':['ó'],
    'u':['ù','ú']
}

str1 = "thójlà"

for key, values in replace_dict.items():
    for character in values:
        str1 = str1.replace(character, key)

print(str1)
Gillespie
  • 5,780
  • 3
  • 32
  • 54
  • is there a way that I can do this without having to create a statement for each character? I could have done that with re.sub but that's what I want to avoid in case the list of characters to be changed becomes large. Thanks for the help! – user2743 Dec 08 '14 at 23:22
  • I also added a third method. – Gillespie Dec 08 '14 at 23:34
  • @RPGillespie, to make it a bit more efficient, you can do: `for key, values in replace_dict:`, `for character in values:` – twasbrillig Dec 08 '14 at 23:40
  • the dictionary replace technique could be pretty slow if there are a lot of replacement characters and/or the replacement text is long. – Matt Coubrough Dec 08 '14 at 23:42
  • @twasbrillig I never realized you could do that! – Gillespie Dec 09 '14 at 01:02
  • Oops, it was actually `for key, values in replace_dict.items():` but it looks like you figured that out already. – twasbrillig Dec 09 '14 at 19:41
1

If you can use external packages, the easiest way, i think, would be using unidecode. For example:

from unidecode import unidecode

print(unidecode('thójlà'))
# prints: thojla
twasbrillig
  • 17,084
  • 9
  • 43
  • 67
Marcin
  • 215,873
  • 14
  • 235
  • 294
  • what if I have other unicode characters that I don't want to substitute will those characters be affected? Thanks for the help! – user2743 Dec 08 '14 at 23:18
  • Yes, all non-ascii characters will be transliterated. – Gillespie Dec 08 '14 at 23:20
  • Maybe there are some options to specify which characters are changed and which not. Dont know about this. I assumed yo want to change everting. – Marcin Dec 08 '14 at 23:21