0

I have the following string:

the quick brown fox abc(1)(x)

with the following regex:

(?i)(\s{1})(abc\(1\)\([x|y]\))

and the output is

abc(1)(x)

which is expected, however, I can't seem to:

  1. use \W \w \d \D etc to extract more than 1 space
  2. combine the quantifier to add more spaces.

I would like the following output:

the quick brown fox abc(1)(x)

from the primary lookup "abc(1)(x)" I would like up to 5 words on either side of the lookup. my assumption is that spaces would demarcate a word.

Edit 1:

The 5 words on either side would be unknown for future examples. the string may be:

cat with a black hat is abc(1)(x) the quick brown fox jumps over the lazy dog.

In this case, the desired output would be:

with a black hat is abc(1)(x) the quick brown fox jumps

Edit 2:

edited the expected output in the first example and added "up to" 5 words

Community
  • 1
  • 1
qbbq
  • 347
  • 1
  • 15

2 Answers2

1
(?:[0-9A-Za-z_]+[^0-9A-Za-z_]+){0,5}abc\(1\)\([xy]\)(?:[^0-9A-Za-z_]+[0-9A-Za-z_]+){0,5}

Note that I've changed \w+ to [0-9A-Za-z_]+ and \W+ to [^0-9A-Za-z_]+ because depending on your locale / Unicode settings \W and \w might not act the way you expect in Python.

Also note I don't specifically look for spaces, just "non-word characters" this probably handles edge cases a little better for quote characters etc. But regardless this should get you most of the way there.

BTW: You calling this "lookaround" - really it has nothing to do with "regex lookaround" the regex feature.

Dean Taylor
  • 40,514
  • 3
  • 31
  • 50
0

If I understand your requirements correctly, you want to do something like this:

(?:\w+[ ]){0,5}(abc\(1\)\([xy]\))(?:[ ]\w+){0,5}

Demo.

BreakDown:

(?:               # Start of a non-capturing group.
    \w+           # Any word character repeated one or more times (basically, a word).
    [ ]           # Matches a space character literally.
)                 # End of the non-capturing group.
{0,5}             # Match the previous group between 0 and 5 times.
(                 # Start of the first capturing group.
    abc\(1\)      # Matches "abc(1)" literally.
    \([xy]\)      # Matches "(x)" or "(y)". You don't need "|" inside a character class.
)                 # End of the capturing group.
(?:[ ]\w+){0,5}   # Same as the non-capturing group above but the space is before the word.

Notes:

  • To make the pattern case insensitive, you may start it with (?i) as you're doing already or use the re.IGNORECASE flag.
  • If you want to support words not separated by a space, you may replace [ ] with either \W+ (which means non-word characters) or with a character class which includes all the punctuation characters that you want to support (e.g., [.,;?! ]).