Regex: make negative lookarounds "greedier"

Question

I have the following Python regex pattern from an earlier question:

regex_pat = re.compile('''
            (
            [a-zA-Z\*]*
            \*
            [a-zA-Z\*]*
            )+           
          ''', re.VERBOSE)

Now I want the match to fail if any digit is mixed in with the "word", especially at the start or the end.

text = '''
    (A) Match these:
    *** star* st**r

    (B) Not these:
    800*m *4,500 

    (C) And not these:
    800**m **4,000
    '''

By trying a pair of negative lookahead and negative lookbehind in various places, I can get rid of the (B) matches, but not the (C) matches. For example:

regex_pat = re.compile('''
            (
            [a-zA-Z\*]*
            (?<!\d)
            \*
            (?!\d)
            [a-zA-Z\*]*
            )+           
          ''', re.VERBOSE) 
regex_pat.findall(text)
# ['***', 'star*', 'st*r', '**m', '**'] The last two matches are no good.

Apparently, when regex runs into a negative lookahead, it takes a step back to see if it can get a match. How can I make the negative lookarounds greedier or more destructive, so to speak?

Try `(?<!\S)(?!\*+\d)[a-zA-Z]*\*[a-zA-Z*]*`, see https://regex101.com/r/Gsq87y/1 — Wiktor Stribiżew, Mar 28 '19 at 21:04
@WiktorStribiżew Lookarounds need to Python's fixed-width in Python, so I doubt `(?!\*+\d)` will work, but your answer inspired me to me to come up with something that seems to work, almost miraculously. Thank you. — bongbang, Mar 28 '19 at 22:09
`(?!\*+\d)` works in Python `re`. It is not a lookbehind, it is a lookahead whose length does not have to be fixed-width. See my answer below. — Wiktor Stribiżew, Mar 28 '19 at 22:09

score 1 · Answer 1 · answered Mar 28 '19 at 22:09

You may use

(?<!\S)(?!\*+\d)[a-zA-Z]*\*[a-zA-Z*]*

See the regex demo.

Details

(?<!\S) - start of string or whitespace
(?!\*+\d) - fail the match if after 1 or more asterisks there is a digit
[a-zA-Z]* - 0+ letters
\* - asterisk
[a-zA-Z*]* - 0+ letters or asterisks.

The point is to start matching at the start of string or after whitespace, check if there is no digit after 1 or more asterisks and then match the pattern you need.

See the Python demo:

import re
text = '''
    (A) Match these:
    *** star* st**r

    (B) Not these:
    800*m *4,500 

    (C) And not these:
    800**m **4,000
    '''
print(re.findall(r'(?<!\S)(?!\*+\d)[a-zA-Z]*\*[a-zA-Z*]*', text))
# => ['***', 'star*', 'st**r']

bongbang · Answer 2 · 2019-03-28T22:47:49.133

-1

This answer to my own question is inspired by Wiktor Stribiżew's comment. It seems to work. I'm posting it here so that a sharper eye may be able tell me any flaws in it.

regex_pat = re.compile('''
            (?<!\S)
            [a-zA-Z*]*            
            \*
            [a-zA-Z*]*
            (?!\S)
          ''', re.VERBOSE)

The logic as I understand is that the lookahead and the lookbehind force any match to be a whole "word", and from there, you won't have to worry about digits in the match anymore because they're not part of the defined character sets to be matched anyway.

edited Mar 28 '19 at 22:47

answered Mar 28 '19 at 22:28

bongbang

1,610
4
18
34

So, you want to say your question is a duplicate of [Regex: Specify “space or start of string” and “space or end of string”](https://stackoverflow.com/a/6713427/3832970)? You only need whitespace boundaries? You did not set these requirements in the question. – Wiktor Stribiżew Mar 28 '19 at 22:55
But your pattern is not even attempting to filter digits. If you need to match whole words within whitespace boundaries, then your answer is valid and the question is a duplicate, else the answer is not valid. – Wiktor Stribiżew Mar 28 '19 at 23:01
@WiktorStribiżew It doesn't need to, at least as far as my cursory tests go. I explained why in the answer. Instead of filtering out digits, it filters *in* alphabets and asterisks. – bongbang Mar 28 '19 at 23:05
@WiktorStribiżew Note that this is not the behavior of the original pattern I started with at the top of this page. With *that* pattern, I needed to filter out digits, hence this question. Your comment inspired me to come up with a better solution. Thank you. – bongbang Mar 28 '19 at 23:09

Regex: make negative lookarounds "greedier"

2 Answers2