0

I am using regex library to find words that are in between specific other words, for example, I want to match "world" if and only if a greeting precedes it and punctuation follows. To avoid matching word prefixes and suffixes, I added the additional condition [^a-zA-Z]. However, once I add these, regex cannot match the word anymore:

>>> import regex

>>> pat = regex.compile("(?<=[^a-zA-Z](hello|hi)\s+)world(?=\s*[!?.][^a-zA-Z])")

>>> list(pat.finditer("hello world!"))
[]

>>> pat = regex.compile("(?<=\b(hello|hi)\s+)world(?=\s*[!?.]\b)")

>>> list(pat.finditer("hello world!"))
[]

>>> pat = regex.compile("(?<=(hello|hi)\s+)world(?=\s*[!?.])")

>>> list(pat.finditer("hello world!"))
[<regex.Match object; span=(6, 11), match='world'>]

How can this be explained? How to make sure to match whole words in the look ahead and behind sections?

Green绿色
  • 1,620
  • 1
  • 16
  • 43
  • does it match `" hello world!"` (with a preceding space)? `[^a-zA-Z]` has width 1, so I think the string can't start with `hello` or `hi`. That said, i've never worked with the `regex` module - only `re` - so can't speak for that package. – Michael Delgado Aug 14 '22 at 04:09
  • It indeed does match `" hello world! "`. Thanks for your help! – Green绿色 Aug 14 '22 at 04:29
  • Try boundaries instead `pat = regex.compile("(?<=\\b(hello|hi)\\s+)world(?=\\b\\s*[!?.])")` – Daniel Aug 14 '22 at 05:24
  • @Daniel That would work for the look-behind, too. But the look-ahead doesn't work as intended, because `"hello world!x"` shouldn't match. – Green绿色 Aug 14 '22 at 08:27
  • If you want to debug a regex, [regex101](https://regex101.com/) is one of best tools currently available, though a web search will turn up more. See also "[How do you debug a regex? \[closed\]](/q/2348694/90527)", "[How much research effort is expected of Stack Overflow users?](//meta.stackoverflow.com/q/261592/90527)". – outis Aug 20 '22 at 02:52

2 Answers2

1

The reason is that when using (?<= and (?= there has to be present on the left and right what you specify.

Note that there is no word boundary after [!?.]\b when there is not a word character following any of the punctuation chars.

You could write the pattern as:

(?<=\b(?:hello|hi)\s+)world(?=\s*[!?.](?!\S))

Explanation

  • (?<= Positive lookbehind, assert that to the left is
    • \b(?:hello|hi)\s+ Match either the word hello or hi and 1+ whitespace chars
  • ) Close lookbhehind
  • world Match literally
  • (?= Positive lookahead, assert that to the right is
    • \s*[!?.] Match optional whitespace chars and one of ! ? .
    • (?!\S) Assert a whitespace boundary to the right
  • ) Close the lookahead

Or asserting a whitespace boundary to the left instead of the word boundary:

(?<=(?<!\S)(?:hello|hi)\s+)world(?=\s*[!?.](?!\S))

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
0

As correctly mentioned by @Michael, the width was the problem. The following does the trick:

>>> import regex

>>> pat = regex.compile("(?<=([^a-zA-Z]|^)(hello|hi)\s+)world(?=\s*[!?.]($|[^a-zA-Z]))")

>>> list(pat.finditer("hello world!"))
[<regex.Match object; span=(6, 11), match='world'>]

>>> list(pat.finditer("hello world!x"))
[]

>>> list(pat.finditer("xhello world!"))
[]
Green绿色
  • 1,620
  • 1
  • 16
  • 43