7

I want to have a regular expression that finds the texts that are "wrapped" in between "HEAD or HEADa" and "HEAD. That is, I may have a text that starts with the first word as HEAD or HEADa and the following "heads" are of type HEAD.

  1. HEAD\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....
  2. HEADa\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....

I want only to capture the text that are in between the "heads" therefore I have a regex with look behind and look ahead expressions looking for my "heads". I have the following regex:

var = "HEADa", "HEAD"

my_pat = re.compile(r"(?<=^\b"+var[0]+r"|"+var[1]+r"\b) \w*\s\s(.*?)(?=\b"+var[1] +r"\b)",re.DOTALL|re.MULTILINE)

However, when I try to execute this regex, I am getting an error message saying that I cannot have variable length in the look behind expression. What is wrong with this regex?

Chris Morgan
  • 86,207
  • 24
  • 208
  • 215
andreSmol
  • 1,028
  • 2
  • 18
  • 30

1 Answers1

14

Currently, the first part of your regex looks like this:

(?<=^\bHEADa|HEAD\b)

You have two alternatives; one matches five characters and the other matches four, and that's why you get the error. Some regex flavors will let you do that even though they say they don't allow variable-length lookbehinds, but not Python. You could break it up into two lookbehinds, like this:

(?:(?<=^HEADa\b)|(?<=\bHEAD\b))

...but you probably don't need lookbehinds for this anyway. Try this instead:

(?:^HEADa|\bHEAD)\b

Whatever gets matched by the (.*?) later on will still be available through group #1. If you really need the whole of the text between the delimiters, you can capture that in group #1, and that other group will become #2 (or you can use named groups, and not have to keep track of the numbers).

Generally speaking, lookbehind should never be your first resort. It may seem like the obvious tool for the job, but you're usually better off doing a straight match and extracting the part you want with a capturing group. And that's true of all flavors, not just Python; just because you can do more with lookbehinds in other flavors doesn't mean you should.

BTW, you may have noticed that I redistributed your word boundaries; I think this is what you really intended.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • You got me to the point :) +1. In addition you could interpolate your variable like this : **regex = re.compile('(?<=^\b%s|%s\b) \w*\s\s(.*?)(?=\b%s\b)'%(var[0],var[1],var[1]), re.DOTALL|re.MULTILINE)** – FailedDev Nov 19 '11 at 15:02