What does this pattern (?<=\w)\W+(?=\w)
mean in a Python regular expression?
#l is a list
print(re.sub("(?<=\w)\W+(?=\w)", " ", l))
What does this pattern (?<=\w)\W+(?=\w)
mean in a Python regular expression?
#l is a list
print(re.sub("(?<=\w)\W+(?=\w)", " ", l))
Here's a breakdown of the elements:
\w
means an alphanumeric character\W+
is the opposite of \w
; with the +
it means one or more non-alphanumeric characters?<=
is called a "lookbehind assertion"?=
is a "lookahead assertion"So this re.sub
statement means "if there are one or more non-alphanumeric characters with an alphanumeric character before and after, replace the non-alphanumeric character(s) with a space".
And by the way, the third argument to re.sub
must be a string (or bytes-like object); it can't be a list.
Just put it into a site like regex101.com and hover the cursor over the parts.
It would match non-word chars between word chars. Bits between the last 'd' of 'word' and the first 'w' of 'word' from the string below as an example...
word^&*((*&^%$%^&*& ^%$£%^&**&^%$£!"£$%^&*()word
Example:
import re
#if it is a list...
l = ['John Smith', 'This%^&*(string', 'Never!£$Mind^&*I$?/Solved{}][]It']
#l is a list
print(re.sub(r"(?<=\w)\W+(?=\w)", " ", l[2]))
Never Mind I Solved It