Using regex, extract quoted strings that may contain nested quotes

Question

I have the following string:

'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'

Now, I wish to extract the following quotes:

1. Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!
2. How Doth the Little Busy Bee,
3. I'll try again.

I tried the following code but I'm not getting what I want. The [^\1]* is not working as expected. Or is the problem elsewhere?

import re

s = "'Well, I've tried to say \"How Doth the Little Busy Bee,\" but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'"

for i, m in enumerate(re.finditer(r'([\'"])(?!(?:ve|m|re|s|t|d|ll))(?=([^\1]*)\1)', s)):
    print("\nGroup {:d}: ".format(i+1))
    for g in m.groups():
        print('  '+g)

Well, try `r'([\'"])(?!(?:ve|m|re|s|t|d|ll))(?=(?:(?!\1).)*)\1)'` — Wiktor Stribiżew, Sep 22 '16 at 11:56
I got error: `sre_constants.error: unbalanced parenthesis at position 49`. I tried to remove the extra closing parenthesis but matches are not as expected. — coder.in.me, Sep 22 '16 at 12:04
Yes, one `)` is redundant: `r'([\'"])(?!(?:ve|m|re|s|t|d|ll))(?=(?:(?!\1).)*\1)'`. See https://regex101.com/r/gM1fO7/1. I see that it only prints the quotes. The point is that `[^\1]` does not match anything other than the Group 1 value. — Wiktor Stribiżew, Sep 22 '16 at 12:05
My output is blank. Just quotes are captured but not the text within. Thx — coder.in.me, Sep 22 '16 at 12:08
Two good answers: from m.cekiera and Steve Chambers. Not sure who should be given the bounty! — coder.in.me, Sep 27 '16 at 07:50

Steve Chambers · Accepted Answer · 2020-05-19T20:45:32.483

6

If you really need to return all the results from a single regular expression applied only once, it will be necessary to use lookahead ((?=findme)) so the finding position goes back to the start after each match - see this answer for a more detailed explanation.

To prevent false matches, some clauses are also needed regarding the quotes that add complexity, e.g. the apostrophe in I've shouldn't count as an opening or closing quote. There's no single clear-cut way of doing this but the rules I've gone for are:

An opening quote must not be immediately preceeded by a word character (e.g. letter). So for example, A" would not count as an opening quote but ," would count.
A closing quote must not be immediately followed by a word character (e.g. letter). So for example, 'B would not count as a closing quote but '. would count.

Applying the above rules leads to the following regular expression:

(?=(?:(?<!\w)'(\w.*?)'(?!\w)|\"(\w.*?)\"(?!\w)))

Regular expression visualization

Debuggex Demo

A good quick sanity check test on any possible candidate regular expression is to reverse the quotes. This has been done in this regex101 demo.

edited May 19 '20 at 20:45

answered Sep 26 '16 at 15:18

Steve Chambers

37,270
24
156
208

1

This is an equally good answer as m.cekiera's. I have shortened your regex: `(?=(?:(?<!\w)(['"])(\w.*?)\1(?!\w)))` – coder.in.me Sep 27 '16 at 04:51
Agreed - the only difference with that alternative is it includes the quotation characters as capturing groups. – Steve Chambers Sep 27 '16 at 08:39
A possible hybrid between the two that avoids this is `(?=(?<!\w)(?:'(\w.*?)'|"(\w.*?)")(?!\w))`. – Steve Chambers Sep 27 '16 at 08:45
2

@arvindpdmn actually, this one is better than mine, its simpler, faster, and it also match better. Compare mine [answer](https://regex101.com/r/rS4iP1/3) with @SteveChambers [answer](https://regex101.com/r/rS4iP1/4) on on example like: `"sentence!". "next"`to see what I mean – m.cekiera Sep 27 '16 at 13:53
For posterity have now updated the answer to use the hybrid regular expression mentioned above. – Steve Chambers May 19 '20 at 20:46

m.cekiera · Answer 2 · 2016-09-27T14:29:50.653

EDIT

I modified my regex, it match properly even more complicated cases:

(?=(?<!\w|[!?.])('|\")(?!\s)(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w))

DEMO

It is now even more complicated, the main improvement is not matching directly after some of punctuation character ([!?.]) and better quote case separation. Verified on diversified examples.

The sentence will be in content captured group. Of course it has some restrictions, releted to usage of whitespaces, etc. But it should work with most of proper formatted sentences - or at least it work with examples.

(?=(?<!\w|[!?.])('|\")(?!\s) - match the ' or " not preceded by word or punctuation character ((?<!\w|[!?.])) or not fallowed by whitespace((?!\s)), the ' or " part is captured in group 1 to further use,
(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w)) - match sentence, followed by same char (' or " captured in group 1) as it was started, ignore other quotes

It doesn't match whole sentence directly, but with capturing group nested in lookaround construct, so with global match modifier it will match also sentences inside sentences - because it directly match only the place before sentence starts.

About your regex:

I suppose, that by [^\1]* you meant any char but not one captured in group 1, but character class doesn't work this way, because it treats \1 as an char in octal notation (which I think is some kind of whitespace) not a reference to capturing group. Take a look on this example - read explanation. Also compare matching of THIS and THIS regex.

To achieve what you want, you should use lookaround, something like this: (')((?:.(?!\1))*.) - capture the opening char, then match every char which is not followed by captured opening char, then capture one more char, which is directly before captured char - and you have whole content between chars you excluded.

Seems to work. Awesome! Can you explain why my regex is not working, especially the `[^\1]*` part? Looks like a perfectly valid expression. Thx — coder.in.me, Sep 26 '16 at 13:35
@arvindpdmn I updated answer with explanation about `[^\1]*` — m.cekiera, Sep 26 '16 at 14:04
Thx. Your answer gave me a good understanding of lookahead and lookbehind. — coder.in.me, Sep 27 '16 at 07:49
@arvindpdmn I updated answer, compare our answers on examples from my demo, to see difference — m.cekiera, Sep 27 '16 at 14:30

score 2 · Answer 3 · answered Sep 27 '16 at 15:59

This is a great question for Python regex because sadly, in my opinion the re module is one of the most underpowered of mainstream regex engines. That's why for any serious regex work in Python, I turn to Matthew Barnett's stellar regex module, which incorporates some terrific features from Perl, PCRE and .NET.

The solution I'll show you can be adapted to work with re, but it is much more readable with regex because it is made modular. Also, consider it as a starting block for more complex nested matching, because regex lets you write recursive regular expressions similar to those found in Perl and PCRE.

Okay, enough talk, here's the code (a mere four lines apart from the import and definitions). Please don't let the long regex scare you: it is long because it is designed to be readable. Explanations follow.

The Code

import regex

quote = regex.compile(r'''(?x)
(?(DEFINE)
(?<qmark>["']) # what we'll consider a quotation mark
(?<not_qmark>[^'"]+) # chunk without quotes
(?<a_quote>(?P<qopen>(?&qmark))(?&not_qmark)(?P=qopen)) # a non-nested quote
) # End DEFINE block

# Start Match block
(?&a_quote)
|
(?P<open>(?&qmark))
  (?&not_qmark)?
  (?P<quote>(?&a_quote))
  (?&not_qmark)?
(?P=open)
''')

str = """'Well, I have tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I will try again.'"""

for match in quote.finditer(str):
    print(match.group())
    if match.group('quote'):
        print(match.group('quote'))

The Output

'Well, I have tried to say "How Doth the Little Busy Bee," but it all came different!'
"How Doth the Little Busy Bee,"
'I will try again.'

How it Works

First, to simplify, note that I have taken the liberty of converting I'll to I will, reducing confusion with quotes. Addressing I'll would be no problem with a negative lookahead, but I wanted to make the regex readable.

In the (?(DEFINE)...) block, we define the three sub-expressions qmark, not_qmark and a_quote, much in the way that you define variables or subroutines to avoid repeating yourself.

After the definition block, we proceed to matching:

(?&a_quote) matches an entire quote,
| or...
(?P<open>(?&qmark)) matches a quotation mark and captures it to the open group,
(?&not_qmark)? matches optional text that is not quotes,
(?P<quote>(?&a_quote)) matches a full quote and captures it to the quote group,
(?&not_qmark)? matches optional text that is not quotes,
(?P=open) matches the same quotation mark that was captured at the opening of the quote.

The Python code then only needs to print the match and the quote capture group if present.

Can this be refined? You bet. Working with (?(DEFINE)...) in this way, you can build beautiful patterns that you can later re-read and understand.

Adding Recursion

If you want to handle more complex nesting using pure regex, you'll need to turn to recursion.

To add recursion, all you need to do is define a group and refer to it using the subroutine syntax. For instance, to execute the code within Group 1, use (?1). To execute the code within group something, use (?&something). Remember to leave an exit for the engine by either making the recursion optional (?) or one side of an alternation.

References

Thanks for the detailed answer and explanation. – coder.in.me Sep 28 '16 at 09:00 — coder.in.me, Sep 28 '16 at 09:00

score 0 · Answer 4 · edited Sep 26 '16 at 18:00

It seems difficult to achieve with juste one regex pass, but it could be done with a relatively simple regex and a recursive function:

import re

REGEX = re.compile(r"(['\"])(.*?[!.,])\1", re.S)

S = """'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.' 'And we may now add "some more 'random test text'.":' "Yes it seems to be a good idea!" 'ok, let's go.'"""


def extract_quotes(string, quotes_list=None):
    list = quotes_list or []
    list += [found[1] for found in REGEX.findall(string)]
    print("found: {}".format(quotes_list))
    index = 0
    for quote in list[:]:
        index += 1
        sub_list = extract_quotes(quote)
        list = list[:index] + sub_list + list[index:]
        index += len(sub_list)
    return list


print extract_quotes(S)

This prints:

['Well, I\'ve tried to say "How Doth the Little Busy Bee," but it all came different!', 'How Doth the Little Busy Bee,', "I'll try again.", 'And we may now add "some more \'random test text\'.":\' "Yes it seems to be a good idea!" \'ok, let\'s go.', "some more 'random test text'.", 'Yes it seems to be a good idea!']

Note that the regex uses the punctuation to determine if a quoted text is a "real quote". in order to be extracted, a quote need to be ended with a punctuation character before the closing quote. That is 'random test text' is not considered as an actual quote, while 'ok let's go.' is.

The regex is pretty simple, I think it does not need explanation. Thue extract_quotes function find all quotes in the given string and store them in the quotes_list. Then, it calls itself for each found quote, looking for inner quotes...

Recursive is a good idea. I was hoping a single regex will do it. Also, your approach might need to be expanded to other punctuation, which is difficult because the ending characters in quote could be anything. Your sample text is complex, which is good for testing. The output does not appear to be as expected for the extra text that you added. — coder.in.me, Sep 22 '16 at 14:39

Using regex, extract quoted strings that may contain nested quotes

4 Answers4