Given this example:
s = "Hi, domain: (foo.bar.com) bye"
I'd like to create a regex that matches both word and non-word strings, separately, i.e:
re.findall(regex, s)
# Returns: ["Hi", ", ", "domain", ": (", "foo.bar.com", ") ", "bye"]
My approach was to use the word boundary separator \b
to catch any string that is bound by two word-to-non-word switches. From the re
module docs:
\b
is defined as the boundary between a\w
and a\W
character (or vice versa)
Therefore I tried as a first step:
regex = r'(?:^|\b).*?(?=\b|$)'
re.findall(regex, s)
# Returns: ["Hi", ",", "domain", ": (", "foo", ".", "bar", ".", "com", ") ", "bye"]
The problem is that I don't want the dot (.
) character to be a separator too, I'd like the regex to see foo.bar.com
as a whole word and not as three words separated by dots.
I tried to find a way to use a negative lookahead on dot but did not manage to make it work.
Is there any way to achieve that?
I don't mind that the dot won't be a separator at all in the regex, it doesn't have to be specific to domain names.
I looked at Regex word boundary alternative, Capture using word boundaries without stopping at "dot" and/or other characters and Regex word boundary excluding the hyphen but it does not fit my case as I cannot use the space as a separator condition.
Exclude some characters from word boundary is the only one that got me close, but I didn't manage to reach it.