3

This question is about matching previously defined groups in python...but it is not quite as simple as that.

Here is the text that I want to match:

Figure 1: Converting degraded weaponry to research materials.

Converting degraded weaponry to research
materials.

Here is my regular expression:

(Figure )(\d)(\d)?(: )(?P<description>.+)(\n\n)(?P=description)

Now, the problem with what I currently have is that the regular expression fails to match the text because of the linefeed that appears after "research" on the third line. I want python to ignore linefeeds when matching the previous group to my string.

Cristian Lupascu
  • 39,078
  • 16
  • 100
  • 137
xaav
  • 7,876
  • 9
  • 30
  • 47
  • That's not a thing in standard regular expressions, near as I know. Try Python's fuzzy matching. – FrankieTheKneeMan Oct 23 '13 at 17:53
  • 1
    I believe you can accomplish this with `re.MULTILINE`. See if this helps: http://stackoverflow.com/questions/587345/python-regular-expression-matching-a-multiline-block-of-text – Hoopdady Oct 23 '13 at 18:00
  • Unfortunately, simply enabling re.MULTILINE was no help. – xaav Oct 23 '13 at 18:10
  • @Hoopdady No, re.MULTILINE only causes the `^` and `$` anchors to match at the beginning and end of every line, instead of only at the beginning and end of the string. http://docs.python.org/2/library/re.html#module-contents – FrankieTheKneeMan Oct 23 '13 at 18:10
  • 1
    You have to canonize the text beforehand in some way, for that kind of match to work. One possibility is `textwrap`. – jhermann Oct 23 '13 at 20:40

1 Answers1

0

There seem to be two general approaches to this: either canonicalize the text (as suggested by jhermann), or have a function/code fragment that runs for each probable match and does a more complicated comparison than you could do in a single regex.

Canonicalize:

text = re.sub(r"\n\n", somespecialsequence, text);
text = re.sun(r"\s*\n", " ", text);
text = re.sub(r"\s+", " ", text);
text = re.sub(somespecialsequence, "\n\n", text);

Now, this should work as expected: (Figure )(\d)(\d)?(: )(?P<description>.+)(\n\n)(?P=description)

Or, use code fragment:

matches = re.finditer(r"(Figure )(\d+)(: )(.+)(\n\n)(.+)(?=Figure )", text, flags=re.S)
for m in matches:
    text1 = m.group(4)
    text2 = m.group(6)
    text1 = re.sub("\W+", " ", text1)
    text2 = re.sub("\W+", " ", text2)
    if (text1 == text2):
        // this is a match
Alex I
  • 19,689
  • 9
  • 86
  • 158