Struggling at creating correct regex

Question

I try to get a match for every 'word' (lowercase or capital):
(word|WORD)
No chars or numbers before or following:
(?<![^a-zA-Z0-9])(word|WORD)(?![^a-zA-Z0-9])
In case 'word' is at the start or end of the string:
^|(?<![^a-zA-Z0-9])(word|WORD)(?![^a-zA-Z0-9])|$

It simply doesn't work, any suggestions?

Maybe `(?i)\bword\b`? Not sure if python supports that. (.. or can only all caps or all lowercase is supported?) — user3783243, Sep 24 '20 at 17:42
Python supports the case insensitive flag, but in case OP wants fixed case only `\b(word|WORD)\b` with `re.findall` should do the trick. — Axe319, Sep 24 '20 at 18:28
`word` is just an accronym right? There really is no such thing as a _word_ boundary in regex since there is no language specified in regex constructs. What you're looking for is your own definition of a boundary as you define it. Lets say you have a letter `W` you don't want letters or numbers surrounding it `(?<![a-zA-Z0-9])W(?![a-zA-Z0-9])` Presto your own boundary definition. Lets say you want a `Q` surrounding it, or end/begin of string. `(?<![^Q])W(?![^Q])` and again a new boundary definition. The idea of a word in regex is meaningless! — , Sep 24 '20 at 21:25

Jan · Accepted Answer · 2020-09-24T18:21:41.547

You might be looking for

import re

text = "123 Lorem ipsum dolor sit amet, word WORD WoRd consetetur sadipscing elitr, sed diam 123"

pattern = re.compile(r'\bword\b', re.IGNORECASE)

for word in pattern.finditer(text):
    print(word.group(0))

Which would yield

word
WORD
WoRd

\b is the short form for

(?:(?=\w)(?<!\w)|(?<=\w)(?!\w))

Which reads

(?=\w)(?<!\w) # positive lookahead making sure there's a word character coming
              # negative lookbehind making sure theres' n word characte preceding
|             # or
(?<=\w)(?!\w) # the other way round

So, yes

(?:(?=\w)(?<!\w)|(?<=\w)(?!\w))word(?:(?=\w)(?<!\w)|(?<=\w)(?!\w))

would yield the same matches as above but seems a bit unreadable.

score 0 · Answer 2 · answered Sep 24 '20 at 18:08

(?<![^a-zA-Z0-9]) is a double negative. You're saying it should NOT match if the character before the main expression is NOT in [a-zA-Z0-9], that is, it can only match if the that character is in [a-zA-Z0-9]. Just remove the ^: (?<![a-ZA-Z0-9]).

You're use of the string boundaries ^ and $ are confusing here, but you shouldn't need them if you're using negative look-behind and negative look-ahead.

So, switch to (?<![a-zA-Z0-9])(word|WORD)(?![a-zA-Z0-9]).

That said, @user3783243's comment about \b is a better option. \b is a 'word boundary', which represents exactly what you are trying to capture. Python does support it: official docs. Related: Regular expression optional match start/end of line

So you should actually just use \b(word|WORD)\b.

Struggling at creating correct regex

2 Answers2