0
  1. I try to get a match for every 'word' (lowercase or capital):
    (word|WORD)
  2. No chars or numbers before or following:
    (?<![^a-zA-Z0-9])(word|WORD)(?![^a-zA-Z0-9])
  3. In case 'word' is at the start or end of the string:
    ^|(?<![^a-zA-Z0-9])(word|WORD)(?![^a-zA-Z0-9])|$

It simply doesn't work, any suggestions?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
meadow
  • 67
  • 1
  • 6
  • 1
    Maybe `(?i)\bword\b`? Not sure if python supports that. (.. or can only all caps or all lowercase is supported?) – user3783243 Sep 24 '20 at 17:42
  • Python supports the case insensitive flag, but in case OP wants fixed case only `\b(word|WORD)\b` with `re.findall` should do the trick. – Axe319 Sep 24 '20 at 18:28
  • `word` is just an accronym right? There really is no such thing as a _word_ boundary in regex since there is no language specified in regex constructs. What you're looking for is your own definition of a boundary as you define it. Lets say you have a letter `W` you don't want letters or numbers surrounding it `(?<![a-zA-Z0-9])W(?![a-zA-Z0-9])` Presto your own boundary definition. Lets say you want a `Q` surrounding it, or end/begin of string. `(?<![^Q])W(?![^Q])` and again a new boundary definition. The idea of a word in regex is meaningless! –  Sep 24 '20 at 21:25

2 Answers2

2

You might be looking for

import re

text = "123 Lorem ipsum dolor sit amet, word WORD WoRd consetetur sadipscing elitr, sed diam 123"

pattern = re.compile(r'\bword\b', re.IGNORECASE)

for word in pattern.finditer(text):
    print(word.group(0))

Which would yield

word
WORD
WoRd

\b is the short form for

(?:(?=\w)(?<!\w)|(?<=\w)(?!\w))

Which reads

(?=\w)(?<!\w) # positive lookahead making sure there's a word character coming
              # negative lookbehind making sure theres' n word characte preceding
|             # or
(?<=\w)(?!\w) # the other way round

So, yes

(?:(?=\w)(?<!\w)|(?<=\w)(?!\w))word(?:(?=\w)(?<!\w)|(?<=\w)(?!\w))

would yield the same matches as above but seems a bit unreadable.

Jan
  • 42,290
  • 8
  • 54
  • 79
0

(?<![^a-zA-Z0-9]) is a double negative. You're saying it should NOT match if the character before the main expression is NOT in [a-zA-Z0-9], that is, it can only match if the that character is in [a-zA-Z0-9]. Just remove the ^: (?<![a-ZA-Z0-9]).

You're use of the string boundaries ^ and $ are confusing here, but you shouldn't need them if you're using negative look-behind and negative look-ahead.

So, switch to (?<![a-zA-Z0-9])(word|WORD)(?![a-zA-Z0-9]).

That said, @user3783243's comment about \b is a better option. \b is a 'word boundary', which represents exactly what you are trying to capture. Python does support it: official docs. Related: Regular expression optional match start/end of line

So you should actually just use \b(word|WORD)\b.

Elliot Way
  • 243
  • 1
  • 9