7

I have a simple function to remove a "word" from some text:

def remove_word_from(word, text):
    if not text or not word: return text
    rec = re.compile(r'(^|\s)(' + word + ')($|\s)', re.IGNORECASE)    
    return rec.sub(r'\1\3', text, 1)    

The problem, of course, is that if word contains characters such as "(" or ")" things break, and it generally seems unsafe to stick a random word in the middle of a regex.

What's best practice for handling cases like this? Is there a convenient, secure function I can call to escape "word" so it's safe to use?

Parand
  • 102,950
  • 48
  • 151
  • 186
  • Note that `r"\n" + "\n"` is not the same as `r"\n" + r"\n"`, though Python lets you slide with \s here. – Fred Nurk Jan 26 '11 at 17:01

3 Answers3

24

You can use re.escape(word) to escape the word.

Vlad H
  • 3,629
  • 1
  • 19
  • 13
  • This is a great suggestion -- superior to mine as long as you do not intend for word to have anything like \n and \t in it. – Ishpeck Jan 26 '11 at 16:48
  • 3
    I would also ask him to use `\b` word boundary character. – Senthil Kumaran Jan 26 '11 at 16:55
  • @Ishpeck: Even in that case, this is superior. If you want to parse escape sequences in word, then do that before re.escape. And, of course, if you have those escapes directly in the source (word = "ab\nc"), then word has a literal newline rather than "\n". – Fred Nurk Jan 26 '11 at 16:56
  • 2
    This is why Stackoverflow is awesome. Somehow I'd missed both re.escape and \b, will be using both. – Parand Jan 26 '11 at 16:58
0

Unless you're forced to use regexps, couldn't you use instead the replace method for strings ?

text = text.replace(word, '')

This allows you to get rid of punctuation issues.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
Emmanuel
  • 13,935
  • 12
  • 50
  • 72
  • 1
    Though it's often good to consider whether you need regex, using \b (or the not-quite-identical alternative in the question) and re.IGNORECASE makes the regex much easier than correctly doing it otherwise. – Fred Nurk Jan 26 '11 at 16:58
-1

Write a sanitizer function and pass word through that first.

def sanitize(word):
    def literalize(wd, escapee):
        return wd.replace(escapee, "\\%s"%escapee)
    return reduce(literalize, "()[]*?{}.+|", word)

def remove_word_from(word, text):
    if not text or not word: return text
    rec = re.compile(r'(^|\s)(' + sanitize(word) + ')($|\s)', re.IGNORECASE)    
    return rec.sub(r'\1\3', text, 1)   
Ishpeck
  • 2,001
  • 1
  • 19
  • 21