2

I often find myself wandering through large sets of text, extracting terms or otherwise cleaning things so I re-use a string as a filename or such like.

In a recent task, I grabbed a few hundred pdf files from a website, and wanted to use the article title as the filename to assist my colleagues in checking in the files.

I can get the title from the html, but often illegal win O/S chars are used in the title (e.g. :, ", > etc), which means I have to do some substitutions to ensure that I can use the title.

As a result of the above, I started using this line of code:-

fname = art_number+" "+content_title.replace(":", " -").replace("&#8211;", "-").replace(u'\xae', "-").replace("\"", "").replace("?","").replace("<i>", "").replace("</i>", "").replace("/", " ").replace("<sup>-< sup>", "-")

As you can see. Heaps of str.replace, not very readable or manageable.

Each of the replacements are generally manually considered, I wouldn't want to throw them at a code book as there are usually some nuances per set of content that I want to find and check.

What would be your approach to this?

Jay Gattuso
  • 3,890
  • 12
  • 37
  • 51
  • 2
    At least consider building a lookup table instead of one big chain.... and possibly utilising `re.sub` – Jon Clements Dec 04 '13 at 18:17
  • @JonClements I have built lookups for larger projects - this one started with only two replacements then popped. Whats the advantage of `re.sub`? – Jay Gattuso Dec 04 '13 at 18:24

4 Answers4

2

For single-char replacements, I would use unicode.translate

For longer strings, I would build a dict of possible replacements indexed by leading two characters, then step through the string testing only the possible replacements at each position.

Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
1
import re
keys = ":","&#8211;",...
def replacer(match):
    return {
       "&#8211;": "-",
       ":":"-",
        ...
    }[match.group(0)]

re.sub("|".join("(%s)"%k for k in sorted(keys,key=len,reverse=True)),replacer,my_text)

I think would work ....

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • 1
    The premise is fine -a few points though: use `re.escape` on the elements as some of them could affect the operation of the regex unescaped - you're missing the `key=` and `reverse=` for the `sorted` and the `replacer` function will receive a match object, not a string... – Jon Clements Dec 04 '13 at 18:43
1

This answer from a previous question asked would work well for you I think. Python replace multiple strings. It wasn't the excepted answer, but it works well and is in a nice small function.

Community
  • 1
  • 1
Chris Hagmann
  • 1,086
  • 8
  • 14
1

You could use reduce() and a sequence of the replacement pairs:

from functools import reduce

replacements = (":", " -"), ("a", "1"), ("b", "2"), ("c", "3")
content_title = "Testing: abc"
print reduce(lambda s, args: s.replace(*args), replacements, content_title)

Output:

Testing - 123
martineau
  • 119,623
  • 25
  • 170
  • 301