This is a great question for Python regex because sadly, in my opinion the re
module is one of the most underpowered of mainstream regex engines. That's why for any serious regex work in Python, I turn to Matthew Barnett's stellar regex module, which incorporates some terrific features from Perl, PCRE and .NET.
The solution I'll show you can be adapted to work with re
, but it is much more readable with regex
because it is made modular. Also, consider it as a starting block for more complex nested matching, because regex
lets you write recursive regular expressions similar to those found in Perl and PCRE.
Okay, enough talk, here's the code (a mere four lines apart from the import and definitions). Please don't let the long regex scare you: it is long because it is designed to be readable. Explanations follow.
The Code
import regex
quote = regex.compile(r'''(?x)
(?(DEFINE)
(?<qmark>["']) # what we'll consider a quotation mark
(?<not_qmark>[^'"]+) # chunk without quotes
(?<a_quote>(?P<qopen>(?&qmark))(?¬_qmark)(?P=qopen)) # a non-nested quote
) # End DEFINE block
# Start Match block
(?&a_quote)
|
(?P<open>(?&qmark))
(?¬_qmark)?
(?P<quote>(?&a_quote))
(?¬_qmark)?
(?P=open)
''')
str = """'Well, I have tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I will try again.'"""
for match in quote.finditer(str):
print(match.group())
if match.group('quote'):
print(match.group('quote'))
The Output
'Well, I have tried to say "How Doth the Little Busy Bee," but it all came different!'
"How Doth the Little Busy Bee,"
'I will try again.'
How it Works
First, to simplify, note that I have taken the liberty of converting I'll
to I will
, reducing confusion with quotes. Addressing I'll
would be no problem with a negative lookahead, but I wanted to make the regex readable.
In the (?(DEFINE)...)
block, we define the three sub-expressions qmark
, not_qmark
and a_quote
, much in the way that you define variables or subroutines to avoid repeating yourself.
After the definition block, we proceed to matching:
(?&a_quote)
matches an entire quote,
|
or...
(?P<open>(?&qmark))
matches a quotation mark and captures it to the open
group,
(?¬_qmark)?
matches optional text that is not quotes,
(?P<quote>(?&a_quote))
matches a full quote and captures it to the quote
group,
(?¬_qmark)?
matches optional text that is not quotes,
(?P=open)
matches the same quotation mark that was captured at the opening of the quote.
The Python code then only needs to print the match and the quote
capture group if present.
Can this be refined? You bet. Working with (?(DEFINE)...)
in this way, you can build beautiful patterns that you can later re-read and understand.
Adding Recursion
If you want to handle more complex nesting using pure regex, you'll need to turn to recursion.
To add recursion, all you need to do is define a group and refer to it using the subroutine syntax. For instance, to execute the code within Group 1, use (?1)
. To execute the code within group something
, use (?&something)
. Remember to leave an exit for the engine by either making the recursion optional (?
) or one side of an alternation.
References