0

I was intending to replace all occurrences of stopwords in a document using re.sub(). The first thing I tried was the obvious one:

re.sub('how|to', 
       '@', 
       'how to do this thing') 
#Replacing with @ for easy identification of replaced words

This returns, as expected, '@ @ do this thing'. However, this would also match subwords, not necessarily whole words. So for example

re.sub('how|to', 
       '@', 
       'how to do this thinghowto'

returns '@ @ do this thing@@'. Since I want to match whole words only, I tried

re.sub('(^|\s)how($|\s)|(^|\s)to($|\s)',
       '@',
       'how to do this thinghowto')

and it returns '@to do this thinghowto'. It looks like the pattern didn't match for ' to '. To see whether this is the case, I test out:

re.sub('(^|\s)how($|\s)|(^|\s)to($|\s)',
       '@',
       'how to to how how howtohowto to how')

and get '@to@how@howtohowto@how'. The pattern seemingly skips over every other match, but it does match only whole-word occurrences. (In case you're familiar with ML parlance, it looks like this has a perfect precision, but only 50% recall.) Why is this happening, and what could I do to resolve this issue?

Susmit Islam
  • 11
  • 1
  • 1
  • Use whitespace boundaries, `r'(?<!\S)(?:how|to)(?!\S)'`. Lookarounds are not consuming text, so the consecutive occurrences will get matched. Check [this thread](https://stackoverflow.com/questions/4295591/why-does-re-sub-in-python-not-work-correctly-on-this-test-case) to see the example of the technique to replace consuming patterns with non-consuming. – Wiktor Stribiżew Mar 04 '21 at 14:38
  • (BTW, not my downvote, I think this question is good as a "signpost".) – Wiktor Stribiżew Mar 04 '21 at 14:53
  • Yeah I made sure to write this so that if someone were searching like me they found it relatively easily. And the downvote is okay I guess, the question does seem a bit dumb after seeing the explanation. Anyhow, thanks for the help! – Susmit Islam Mar 05 '21 at 11:07
  • BTW, if anyone else sees this in the future, a much simpler solution is using the word boundary special character, \b: ```re.sub ( '\\bhow\\b|\\bto\\b', '@', text) ``` – Susmit Islam Mar 05 '21 at 12:09
  • There is a [canonical dupe target for word boundary](https://stackoverflow.com/questions/15863066/python-regular-expression-match-whole-word), too. – Wiktor Stribiżew Mar 05 '21 at 12:11

0 Answers0