0

I have the following content with what I think are the possible cases of someone defining an link:

hello <a href='something.jpg'>link</a> world <a href="something.com">link</a> what <a href=something.jpg>link</a>

I also have the following regular expression with a positive look behind:

(?<=href=["\'])something

The expression matches the word "something" in the first two links. In an attempt to capture the third instance of "something" in the link without any quotes, I thought making the ["\'] token optional (using ?) would capture it. The expression now looks like this:

(?<=href=["\']?)something

Unfortunately it now does not mach any of the instances of "something". What could I be doing incorrectly? I'm using http://gskinner.com/RegExr/ to test this out.

Matt W
  • 6,078
  • 3
  • 32
  • 40
  • [Why are you trying to parse HTML with regex?](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Eric Oct 11 '11 at 20:59

1 Answers1

4

Many regex flavors only support fixed-length lookbehind assertions. If you have an optional token in your lookbehind, its length isn't fixed, rendering it invalid.

So the real question is: What regex flavor are you actually targeting with your regex?

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • I can't think of any engine offhand that allows non-fixed-width look-behind assertions. That said, you can use alternation to deal with simple cases like this one, e.g. use `(?<=href=["']|href=)something`. – Lily Ballard Oct 11 '11 at 21:01
  • @KevinBallard: I can think of .NET, Java, PCRE and JGSoft. Only .NET and JGSoft support infinite repetition inside lookbehind, though (as far as I know). The new (and still unofficial) Python regex package does so, too, I think. – Tim Pietzcker Oct 11 '11 at 21:02
  • PCRE requires look behind assertions to be fixed-length, though as I suggested in my comment it does allow alternations with differing lengths. I'm not familiar with the regex engines used by .NET, Java, or JGSoft. – Lily Ballard Oct 11 '11 at 21:06
  • @KevinBallard: According to [regular-expressions.info](http://www.regular-expressions.info/refflavors.html), you're right about PCRE's support for positive lookbehinds, but negative lookbehinds can be finite-length. – Tim Pietzcker Oct 11 '11 at 21:10