0

I have the following Python regex pattern from an earlier question:

regex_pat = re.compile('''
            (
            [a-zA-Z\*]*
            \*
            [a-zA-Z\*]*
            )+           
          ''', re.VERBOSE) 

Now I want the match to fail if any digit is mixed in with the "word", especially at the start or the end.

text = '''
    (A) Match these:
    *** star* st**r

    (B) Not these:
    800*m *4,500 

    (C) And not these:
    800**m **4,000
    '''

By trying a pair of negative lookahead and negative lookbehind in various places, I can get rid of the (B) matches, but not the (C) matches. For example:

regex_pat = re.compile('''
            (
            [a-zA-Z\*]*
            (?<!\d)
            \*
            (?!\d)
            [a-zA-Z\*]*
            )+           
          ''', re.VERBOSE) 
regex_pat.findall(text)
# ['***', 'star*', 'st*r', '**m', '**'] The last two matches are no good.

Apparently, when regex runs into a negative lookahead, it takes a step back to see if it can get a match. How can I make the negative lookarounds greedier or more destructive, so to speak?

bongbang
  • 1,610
  • 4
  • 18
  • 34
  • Try `(?<!\S)(?!\*+\d)[a-zA-Z]*\*[a-zA-Z*]*`, see https://regex101.com/r/Gsq87y/1 – Wiktor Stribiżew Mar 28 '19 at 21:04
  • @WiktorStribiżew Lookarounds need to Python's fixed-width in Python, so I doubt `(?!\*+\d)` will work, but your answer inspired me to me to come up with something that seems to work, almost miraculously. Thank you. – bongbang Mar 28 '19 at 22:09
  • 1
    `(?!\*+\d)` works in Python `re`. It is not a lookbehind, it is a lookahead whose length does not have to be fixed-width. See my answer below. – Wiktor Stribiżew Mar 28 '19 at 22:09

2 Answers2

1

You may use

(?<!\S)(?!\*+\d)[a-zA-Z]*\*[a-zA-Z*]*

See the regex demo.

Details

  • (?<!\S) - start of string or whitespace
  • (?!\*+\d) - fail the match if after 1 or more asterisks there is a digit
  • [a-zA-Z]* - 0+ letters
  • \* - asterisk
  • [a-zA-Z*]* - 0+ letters or asterisks.

The point is to start matching at the start of string or after whitespace, check if there is no digit after 1 or more asterisks and then match the pattern you need.

See the Python demo:

import re
text = '''
    (A) Match these:
    *** star* st**r

    (B) Not these:
    800*m *4,500 

    (C) And not these:
    800**m **4,000
    '''
print(re.findall(r'(?<!\S)(?!\*+\d)[a-zA-Z]*\*[a-zA-Z*]*', text))
# => ['***', 'star*', 'st**r']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
-1

This answer to my own question is inspired by Wiktor Stribiżew's comment. It seems to work. I'm posting it here so that a sharper eye may be able tell me any flaws in it.

regex_pat = re.compile('''
            (?<!\S)
            [a-zA-Z*]*            
            \*
            [a-zA-Z*]*
            (?!\S)
          ''', re.VERBOSE) 

The logic as I understand is that the lookahead and the lookbehind force any match to be a whole "word", and from there, you won't have to worry about digits in the match anymore because they're not part of the defined character sets to be matched anyway.

bongbang
  • 1,610
  • 4
  • 18
  • 34
  • So, you want to say your question is a duplicate of [Regex: Specify “space or start of string” and “space or end of string”](https://stackoverflow.com/a/6713427/3832970)? You only need whitespace boundaries? You did not set these requirements in the question. – Wiktor Stribiżew Mar 28 '19 at 22:55
  • But your pattern is not even attempting to filter digits. If you need to match whole words within whitespace boundaries, then your answer is valid and the question is a duplicate, else the answer is not valid. – Wiktor Stribiżew Mar 28 '19 at 23:01
  • @WiktorStribiżew It doesn't need to, at least as far as my cursory tests go. I explained why in the answer. Instead of filtering out digits, it filters *in* alphabets and asterisks. – bongbang Mar 28 '19 at 23:05
  • @WiktorStribiżew Note that this is not the behavior of the original pattern I started with at the top of this page. With *that* pattern, I needed to filter out digits, hence this question. Your comment inspired me to come up with a better solution. Thank you. – bongbang Mar 28 '19 at 23:09