Capture ALL strings within a Python script with regex

Question

This question was inspired by my failed attempts after trying to adapt this answer: RegEx: Grabbing values between quotation marks

Consider the following Python script (t.py):

print("This is also an NL test")
variable = "!\n"
print('And this has an escaped quote "don\'t"  in it ', variable,
      "This has a single quote ' but doesn\'t end the quote as it" + \
      " started with double quotes")
if "Foo Bar" != '''Another Value''':
    """
    This is just nonsense
    """
    aux = '?'
    print("Did I \"failed\"?", f"{aux}")

I want to capture all strings in it, as:

This is also an NL test
!\n
And this has an escaped quote "don\'t" in it
This has a single quote ' but doesn\'t end the quote as it
started with double quotes
Foo Bar
Another Value
This is just nonsense
?
Did I \"failed\"?
{aux}

I wrote another Python script using re module and, from my attempts into regex, the one which finds most of them is:

import re
pattern = re.compile(r"""(?<=(["']\b))(?:(?=(\\?))\2.)*?(?=\1)""")
with open('t.py', 'r') as f:
    msg = f.read()
x = pattern.finditer(msg, re.DOTALL)
for i, s in enumerate(x):
    print(f'[{i}]',s.group(0))

with the following result:

[0] And this has an escaped quote "don\'t" in it
[1] This has a single quote ' but doesn\'t end the quote as it started with double quotes
[2] Foo Bar
[3] Another Value
[4] Did I \"failed\"?

To improve my failures, I couldn't also fully replicate what I can found with regex101.com:

I'm using Python 3.6.9, by the way, and I'm asking for more insights into regex to crack this one.

Only my opinion, but this really seems much to complex for a simple regex. If the underlying text follows a specific grammar (here Python language), you have to parse it according to the grammar. So for a Python script you should use the `ast` module. For other grammars, [PLY](http://www.dabeaz.com/ply/) can be used to write a parser. — Serge Ballesta, Mar 04 '20 at 10:56

CertainPerformance · Accepted Answer · 2020-03-04T10:52:57.100

3

Because you want to match ''' or """ or ' or " as the delimiter, put all of that into the first group:

('''|"""|["'])

Don't put \b after it, because then it won't match strings when those strings start with something other than a word character.

Because you want to make sure that the final delimiter isn't treated as a starting delimiter when the engine starts the next iteration, you'll need to fully match it (not just lookahead for it).

The middle part to match anything but the delimiter can be:

((?:\\.|.)*?)

Put it all together:

('''|"""|["'])((?:\\.|.)*?)\1

and the result you want will be in the second capture group:

pattern = re.compile(r"""(?s)('''|\"""|["'])((?:\\.|.)*?)\1""")
with open('t.py', 'r') as f:
    msg = f.read()
x = pattern.finditer(msg)
for i, s in enumerate(x):
    print(f'[{i}]',s.group(2))

https://regex101.com/r/dvw0Bc/1

edited Mar 04 '20 at 10:52

answered Mar 04 '20 at 10:44

CertainPerformance

356,069
52
309
320

Nice work, I'm curious as if this would actually fire in Python though. Switching to `Python` on regex101 shows errors. I don't know enough about Python to tell you if it would or would not work. Just a headsup =) – JvdV Mar 04 '20 at 10:59
The "put it all together" is the plain pattern, without escaping. As you can see in the Python code in the answer, since the `"""` is being used as the regex delimiter, the first `"` in the `"""` in the pattern is escaped, resulting in valid syntax. – CertainPerformance Mar 04 '20 at 11:01
Thanks, you really nail it! The only modification I needed to do to work properly to me (maybe a difference on Python's version) is to *compile* the pattern with dotall: `pattern = re.compile(r"""(?s)('''|\"""|["'])((?:\\.|.)*?)\1""", re.DOTALL)` – iperetta Mar 04 '20 at 12:14
1

@iperetta `(?s)` is already making `.` match line break chars, so you may safely remove `re.DOTALL`. – Wiktor Stribiżew Mar 04 '20 at 16:00

Capture ALL strings within a Python script with regex

1 Answers1