RegEx for matching strings with spaces and words

Question

I have the following string:

the quick brown fox abc(1)(x)

with the following regex:

(?i)(\s{1})(abc\(1\)\([x|y]\))

and the output is

abc(1)(x)

which is expected, however, I can't seem to:

use \W \w \d \D etc to extract more than 1 space
combine the quantifier to add more spaces.

I would like the following output:

the quick brown fox abc(1)(x)

from the primary lookup "abc(1)(x)" I would like up to 5 words on either side of the lookup. my assumption is that spaces would demarcate a word.

Edit 1:

The 5 words on either side would be unknown for future examples. the string may be:

cat with a black hat is abc(1)(x) the quick brown fox jumps over the lazy dog.

In this case, the desired output would be:

with a black hat is abc(1)(x) the quick brown fox jumps

Edit 2:

edited the expected output in the first example and added "up to" 5 words

_"I would like 5 words on either side"_ Where are those five words in your desired output? — 41686d6564 stands w. Palestine, Jul 16 '19 at 00:26
the expected output for this specific example is clear, but if you gave another sentence I wouldn't know what would you want to extract. Please clarify what you're trying to do (focus on the _what_ not on the _how_) — Nir Alfasi, Jul 16 '19 at 00:27
Also what regex flavor (or programming language) are you using? — 41686d6564 stands w. Palestine, Jul 16 '19 at 00:31
thanks - I have placed an edit in my original question to address these questions — qbbq, Jul 16 '19 at 00:52
@qbbq So, do you mean that you want _up to_ five words on each side? It's still not clear to me why the expected output of the first example starts with "quick" and not "the". Can you please clarify? — 41686d6564 stands w. Palestine, Jul 16 '19 at 00:58
@AhmedAbdelhameed yes up to five words - its a typo from my side, I will amend this in the original question — qbbq, Jul 16 '19 at 01:04

score 1 · Accepted Answer · answered Jul 16 '19 at 01:06

(?:[0-9A-Za-z_]+[^0-9A-Za-z_]+){0,5}abc\(1\)\([xy]\)(?:[^0-9A-Za-z_]+[0-9A-Za-z_]+){0,5}

Note that I've changed \w+ to [0-9A-Za-z_]+ and \W+ to [^0-9A-Za-z_]+ because depending on your locale / Unicode settings \W and \w might not act the way you expect in Python.

Also note I don't specifically look for spaces, just "non-word characters" this probably handles edge cases a little better for quote characters etc. But regardless this should get you most of the way there.

BTW: You calling this "lookaround" - really it has nothing to do with "regex lookaround" the regex feature.

41686d6564 stands w. Palestine · Answer 2 · 2019-07-16T01:14:38.940

If I understand your requirements correctly, you want to do something like this:

(?:\w+[ ]){0,5}(abc\(1\)\([xy]\))(?:[ ]\w+){0,5}

Demo.

BreakDown:

(?:               # Start of a non-capturing group.
    \w+           # Any word character repeated one or more times (basically, a word).
    [ ]           # Matches a space character literally.
)                 # End of the non-capturing group.
{0,5}             # Match the previous group between 0 and 5 times.
(                 # Start of the first capturing group.
    abc\(1\)      # Matches "abc(1)" literally.
    \([xy]\)      # Matches "(x)" or "(y)". You don't need "|" inside a character class.
)                 # End of the capturing group.
(?:[ ]\w+){0,5}   # Same as the non-capturing group above but the space is before the word.

Notes:

To make the pattern case insensitive, you may start it with (?i) as you're doing already or use the re.IGNORECASE flag.
If you want to support words not separated by a space, you may replace [ ] with either \W+ (which means non-word characters) or with a character class which includes all the punctuation characters that you want to support (e.g., [.,;?! ]).

RegEx for matching strings with spaces and words

2 Answers2