1

How can I define a regex to find multiline comments in python that contain the word "xyz". Example for a string that should match:

"""
blah blah
blah
xyz
blah blah
"""

I tried this regex:

"""((.|\n)(?!"""))*?xyz(.|\n)*?"""

(grep -i -Pz '"""((.|\n)(?!"""))?xyz(.|\n)?"""')

but it was not good enough. for example, for this input

 """
    blah blah blah
    blah
"""

   # xyz
               
 def foo(self):
"""
blah
"""

it matched this string:

"""

   # xyz
               
 def foo(self):
"""

The expected behavior in this case it to not match anything since "xyz" is not inside a comment block.

I wanted it to only find "xyz" within opening quotes and closing quotes, but the string it matches is not inside a quotes block. It matches a string that starts with a quote, has "xyz" in it and ends with a quote, but the matched string is NOT inside a python comment block.

Any idea how to get the required behavior from this regex?

john1994
  • 19
  • 2
  • 3
    parsing programming languages with regexes is a bad idea mostly. Have you considered the `ast` module? – gog Nov 16 '22 at 11:47
  • @john1994 – Are you really demanding _to get the required behavior from **this** regex?_ How about a quite different approach? And what do you want as output - the whole multiline string, or just the line with "xyz"? – Armali Nov 16 '22 at 14:26

1 Answers1

1

The main challenge is keeping the """ ... """ balance of inside and outside a comment.
Here an idea with PCRE (e.g. PyPI regex with Python) or grep -Pz (like in your example).

(?ims)^"""(?:(?:[^"]|"(?!""))*?(xyz))?.*?^"""(?(1)|(*SKIP)(*F))

See this demo at regex101 (used with i ignorecase, m multiline and s dotall flags)

This works because the searchstring is matched optional to prevent backtracking into another match and loosing overall balance. The most simple pattern for keeping the balance would be """.*?""". But as soon as you want to match some substring inside, the regex engine will try to succeed.

To get around this, the searchstring can be matched optionally for keeping balance by preventing backtracking. Simplified example: """([^"]*?xyz)?.*?""" VS not wanted """([^"]*?xyz).*?""".

Now to still let the matches without searchstring fail, I used a conditional afterwards together with PCRE verbs (*SKIP)(*F). If the first group fails (no searchstring inside) the match just gets skipped.


For usage with grep here is a demo at tio.run, or alternatively: pcregrep -M '(?is)pattern'
As mentioned above in Python this pattern requires PyPI regex, see a Python demo at tio.run.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • Wow that looks promising, I gotta say! But for some reason it does not work when I go "grep -P '(?ms)^"""(?:(?:[^"]|"(?!""))*?(xyz))?.*?^"""(?(1)|(*SKIP)(*F))' test " in the terminal, when test content is exactly like the text you put in the regex testing link. Am I doing something wrong here? maybe I have to use some other grep flags? – john1994 Nov 16 '22 at 13:37
  • Yea I understand that, its just weird that the same regex that catches perfectly what I needed in the regex builder link, doesn't do it from the terminal – john1994 Nov 16 '22 at 13:57
  • @john1994 I tried [this at tio.run](https://tio.run/##S0oszvj/PzU5I19BN1VBXUlJKSYvUaGisiomD8xOQmInw9nqCjUK6UWpBQq6AVX5Cuoa9pm5xZpxQAkNeysgio5Tiq0BshWVlDQ1tew1gNo0Ne31tOwhSjQMNWs0tIK9PQM0NbTcNDXV//8HAA) which seemd to work well. Another option is to use [`pcregrep`](https://man7.org/linux/man-pages/man1/pcregrep.1.html) (I edited my answer, see last line). – bobble bubble Nov 16 '22 at 14:47