Matching previously defined groups in python

Question

This question is about matching previously defined groups in python...but it is not quite as simple as that.

Here is the text that I want to match:

Figure 1: Converting degraded weaponry to research materials.

Converting degraded weaponry to research
materials.

Here is my regular expression:

(Figure )(\d)(\d)?(: )(?P<description>.+)(\n\n)(?P=description)

Now, the problem with what I currently have is that the regular expression fails to match the text because of the linefeed that appears after "research" on the third line. I want python to ignore linefeeds when matching the previous group to my string.

That's not a thing in standard regular expressions, near as I know. Try Python's fuzzy matching. — FrankieTheKneeMan, Oct 23 '13 at 17:53
I believe you can accomplish this with `re.MULTILINE`. See if this helps: http://stackoverflow.com/questions/587345/python-regular-expression-matching-a-multiline-block-of-text — Hoopdady, Oct 23 '13 at 18:00
@Hoopdady No, re.MULTILINE only causes the `^` and `$` anchors to match at the beginning and end of every line, instead of only at the beginning and end of the string. http://docs.python.org/2/library/re.html#module-contents — FrankieTheKneeMan, Oct 23 '13 at 18:10
You have to canonize the text beforehand in some way, for that kind of match to work. One possibility is `textwrap`. — jhermann, Oct 23 '13 at 20:40

score 0 · Accepted Answer · answered Oct 25 '13 at 18:39

There seem to be two general approaches to this: either canonicalize the text (as suggested by jhermann), or have a function/code fragment that runs for each probable match and does a more complicated comparison than you could do in a single regex.

Canonicalize:

text = re.sub(r"\n\n", somespecialsequence, text);
text = re.sun(r"\s*\n", " ", text);
text = re.sub(r"\s+", " ", text);
text = re.sub(somespecialsequence, "\n\n", text);

Now, this should work as expected: (Figure )(\d)(\d)?(: )(?P<description>.+)(\n\n)(?P=description)

Or, use code fragment:

matches = re.finditer(r"(Figure )(\d+)(: )(.+)(\n\n)(.+)(?=Figure )", text, flags=re.S)
for m in matches:
    text1 = m.group(4)
    text2 = m.group(6)
    text1 = re.sub("\W+", " ", text1)
    text2 = re.sub("\W+", " ", text2)
    if (text1 == text2):
        // this is a match

Matching previously defined groups in python

1 Answers1