Python - handling multiple str.replace calls better?

Question

I often find myself wandering through large sets of text, extracting terms or otherwise cleaning things so I re-use a string as a filename or such like.

In a recent task, I grabbed a few hundred pdf files from a website, and wanted to use the article title as the filename to assist my colleagues in checking in the files.

I can get the title from the html, but often illegal win O/S chars are used in the title (e.g. :, ", > etc), which means I have to do some substitutions to ensure that I can use the title.

As a result of the above, I started using this line of code:-

fname = art_number+" "+content_title.replace(":", " -").replace("&#8211;", "-").replace(u'\xae', "-").replace("\"", "").replace("?","").replace("<i>", "").replace("</i>", "").replace("/", " ").replace("<sup>-< sup>", "-")

As you can see. Heaps of str.replace, not very readable or manageable.

Each of the replacements are generally manually considered, I wouldn't want to throw them at a code book as there are usually some nuances per set of content that I want to find and check.

What would be your approach to this?

At least consider building a lookup table instead of one big chain.... and possibly utilising `re.sub` — Jon Clements, Dec 04 '13 at 18:17
@JonClements I have built lookups for larger projects - this one started with only two replacements then popped. Whats the advantage of `re.sub`? — Jay Gattuso, Dec 04 '13 at 18:24

score 2 · Answer 1 · answered Dec 04 '13 at 18:21

2

For single-char replacements, I would use unicode.translate

For longer strings, I would build a dict of possible replacements indexed by leading two characters, then step through the string testing only the possible replacements at each position.

answered Dec 04 '13 at 18:21

Hugh Bothwell

55,315
8
84
99

Joran Beasley · Answer 2 · 2013-12-04T22:19:58.447

1

import re
keys = ":","&#8211;",...
def replacer(match):
    return {
       "&#8211;": "-",
       ":":"-",
        ...
    }[match.group(0)]

re.sub("|".join("(%s)"%k for k in sorted(keys,key=len,reverse=True)),replacer,my_text)

I think would work ....

edited Dec 04 '13 at 22:19

answered Dec 04 '13 at 18:29

Joran Beasley

110,522
12
160
179

1

The premise is fine -a few points though: use `re.escape` on the elements as some of them could affect the operation of the regex unescaped - you're missing the `key=` and `reverse=` for the `sorted` and the `replacer` function will receive a match object, not a string... – Jon Clements Dec 04 '13 at 18:43

score 1 · Answer 3 · edited May 23 '17 at 12:13

1

This answer from a previous question asked would work well for you I think. Python replace multiple strings. It wasn't the excepted answer, but it works well and is in a nice small function.

edited May 23 '17 at 12:13

Community

1
1

answered Dec 04 '13 at 18:36

Chris Hagmann

1,086
8
14

martineau · Answer 4 · 2013-12-05T03:30:04.480

1

You could use reduce() and a sequence of the replacement pairs:

from functools import reduce

replacements = (":", " -"), ("a", "1"), ("b", "2"), ("c", "3")
content_title = "Testing: abc"
print reduce(lambda s, args: s.replace(*args), replacements, content_title)

Output:

Testing - 123

edited Dec 05 '13 at 03:30

answered Dec 04 '13 at 18:43

martineau

119,623
25
170
301

Python - handling multiple str.replace calls better?

4 Answers4