2

I want to match all text following >, and optionally match links on the same line:

preg_match('#(href="([^"]*))?.*>(.*)#', '<a href="world.html">Hello', $m);
print_r($m);

Input examples:

<a href="#catch-me" style="nice">Capture this text
This text should be ignored <a href="#me-too">Other text to capture
<p>This line has no link, but should be matched anyway.

Expected result:

[2] => world.html
[3] => Hello

Actual result:

[2] => 
[3] => Hello

It works if I remove the question mark, but then the link obviously isn't optional anymore.

Why is this happening and how do I fix it?

forthrin
  • 2,709
  • 3
  • 28
  • 50
  • And what are other string formats you want to support? `Hello`? Try [`<.*?(href="([^"]*))?(?:(?!href=")[^>])*>(.*)`](https://regex101.com/r/wJ8cQ3/1) – Wiktor Stribiżew Sep 23 '16 at 12:04
  • Ugh! I'm going snow-blind from those complex look-around patterns. If this is the simplest way to do this, I'll split my code into two simple checks. An explanation why it has to be so complicated would be appreciated. – forthrin Sep 23 '16 at 12:08
  • The point is that the `.*` after an optional pattern will almost always "take" the optional subpattern value. Your regex would work for a string like `href="world.html">Hello`. But not if it is preceded with other symbols because the optional pattern matches an empty string, i.e. before *each* non-matching symbol. – Wiktor Stribiżew Sep 23 '16 at 12:11
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – MarcoS Sep 23 '16 at 14:58

1 Answers1

2

When dealing with optional subpatterns that are followed with .*, one must be very careful.

The point is that the .* after an optional pattern will almost always "take" the optional subpattern value. Your regex would work for a string like href="world.html">Hello. But not if it is preceded with other symbols.

Look: when you try your regex against <a href="world.html">Hello, the (href="([^"]*))? that can match an empty string (does not fail when non-matching symbol is encountered), matches the place before < at the beginning. Then, the .* comes into play and matches all up to the end, and starts backtracking. So, the expression finds the last > and then (.*) captures the rest of the line into Group 3.

So, potentially, you could match your values with (href="([^"]*))?(?:(?!href=")[^>])*>(.*) regex that has a (?:(?!href=")[^>])* tempered greedy token (that does not match href=" sequence), or split the task into 2 operations (yes, it is preferable):

1) Grab all the links
2) Check for the optional values.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563