I was intending to replace all occurrences of stopwords in a document using re.sub(). The first thing I tried was the obvious one:
re.sub('how|to',
'@',
'how to do this thing')
#Replacing with @ for easy identification of replaced words
This returns, as expected, '@ @ do this thing'
. However, this would also match subwords, not necessarily whole words. So for example
re.sub('how|to',
'@',
'how to do this thinghowto'
returns '@ @ do this thing@@'
. Since I want to match whole words only, I tried
re.sub('(^|\s)how($|\s)|(^|\s)to($|\s)',
'@',
'how to do this thinghowto')
and it returns '@to do this thinghowto'
. It looks like the pattern didn't match for ' to '. To see whether this is the case, I test out:
re.sub('(^|\s)how($|\s)|(^|\s)to($|\s)',
'@',
'how to to how how howtohowto to how')
and get '@to@how@howtohowto@how'
. The pattern seemingly skips over every other match, but it does match only whole-word occurrences. (In case you're familiar with ML parlance, it looks like this has a perfect precision, but only 50% recall.) Why is this happening, and what could I do to resolve this issue?