Python regex statement

Question

I'd like to create a regex statement in Python 2.7.8 that will substitute characters. It will work like this...

ó -> o
ú -> u
é -> e
á -> a
í -> i
ù,ú  -> u

These are the only unicode characters that I would like to change. Such unicode characters as, ë, ä I don't want to change. So the word, thójlà will become tholja. I'm sure there is a way so that I don't have to create all the regex separately like below.

word = re.sub(ur'ó', ur'o', word)
word = re.sub(ur'ú', ur'u', word)
word = re.sub(ur'é', ur'e', word)
....

I've been trying to figure this out but haven't had any luck. Any help is appreciated!

Are you sure you want regex? This sounds like a job for `replace()` — Gillespie, Dec 08 '14 at 23:16

chapelo · Answer 1 · 2014-12-08T23:55:00.987

4

Try with str.translate and maketrans...

print('thójlà'.translate(str.maketrans('óúéáíùú', 'oueaiuu')))
# thojlà

This way you ensure the only substitutions you want to make.

If you had many strings to change, you should assign your maketrans to a variable, like

table = str.maketrans('óúéáíùú', 'oueaiuu')

and then, each string can be translated as

s.translate(table)

edited Dec 08 '14 at 23:55

answered Dec 08 '14 at 23:39

chapelo

2,519
13
19

1

Nice. "There should be one-- and preferably only one --obvious way to do it." and this is it. – twasbrillig Dec 08 '14 at 23:42
Note, the code above is for Python 3. In Python 2 it's `string`, not `str`: `print 'thójlà'.translate(string.maketrans('óúéáíùú', 'oueaiuu'))` – twasbrillig Dec 08 '14 at 23:52

Gillespie · Answer 2 · 2014-12-09T01:18:01.940

3

With String's replace() function you can do something like:

x = "thójlà"                  
>>> x
'thójlà'
>>> x = x.replace('ó','o')
'thojlà'
>>> x = x.replace('à','a')
'thojla'

A generalized way:

# -*- coding: utf-8 -*-

replace_dict = {
    'á':'a',
    'à':'a',
    'é':'e',
    'í':'i',
    'ó':'o',
    'ù':'u',
    'ú':'u'
}

str1 = "thójlà"

for key in replace_dict:
    str1 = str1.replace(key, replace_dict[key])

print(str1) #prints 'thojla'

A third way, if your list of character mappings is getting too large:

# -*- coding: utf-8 -*-

replace_dict = {
    'a':['á','à'],
    'e':['é'],
    'i':['í'],
    'o':['ó'],
    'u':['ù','ú']
}

str1 = "thójlà"

for key, values in replace_dict.items():
    for character in values:
        str1 = str1.replace(character, key)

print(str1)

edited Dec 09 '14 at 01:18

answered Dec 08 '14 at 23:18

Gillespie

5,780
3
32
54

is there a way that I can do this without having to create a statement for each character? I could have done that with re.sub but that's what I want to avoid in case the list of characters to be changed becomes large. Thanks for the help! – user2743 Dec 08 '14 at 23:22
I also added a third method. – Gillespie Dec 08 '14 at 23:34
@RPGillespie, to make it a bit more efficient, you can do: `for key, values in replace_dict:`, `for character in values:` – twasbrillig Dec 08 '14 at 23:40
the dictionary replace technique could be pretty slow if there are a lot of replacement characters and/or the replacement text is long. – Matt Coubrough Dec 08 '14 at 23:42
@twasbrillig I never realized you could do that! – Gillespie Dec 09 '14 at 01:02
Oops, it was actually `for key, values in replace_dict.items():` but it looks like you figured that out already. – twasbrillig Dec 09 '14 at 19:41

score 1 · Answer 3 · edited Dec 08 '14 at 23:53

1

If you can use external packages, the easiest way, i think, would be using unidecode. For example:

from unidecode import unidecode

print(unidecode('thójlà'))
# prints: thojla

edited Dec 08 '14 at 23:53

twasbrillig

17,084
9
43
67

answered Dec 08 '14 at 23:17

Marcin

215,873
14
235
294

what if I have other unicode characters that I don't want to substitute will those characters be affected? Thanks for the help! – user2743 Dec 08 '14 at 23:18
Yes, all non-ascii characters will be transliterated. – Gillespie Dec 08 '14 at 23:20
Maybe there are some options to specify which characters are changed and which not. Dont know about this. I assumed yo want to change everting. – Marcin Dec 08 '14 at 23:21

Python regex statement

3 Answers3