0

I use PHP pattern modifier "U" to invert the default greedy behavior with preg_match(). However, it doesn't work the way I want. My code:

$str = '<p>
<div><a aaa
    <a href="a.mov"></a>
  </div>
</p>';

$needle = "a.mov";

$pattern = "/\<a.*".preg_quote($needle, "/").".*\<\/a\>/sU";

preg_match($pattern, $str, $matches);
print_r($matches);

I'm trying to match on

<a href="a.mov"></a>

But this chunk of code returns me

<a aaa
    <a href="a.mov"></a>

Can someone shed me some light of where I did wrong?

Aurelio De Rosa
  • 21,856
  • 8
  • 48
  • 71
potato
  • 46
  • 1
  • 4
  • your $matches variable doesn't equal anything, does it? How do you print it when its not initialized – Grigor Oct 14 '11 at 20:33
  • Check this out: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags and then rewrite this to use DOM operations instead of Regexes. Your broken ` – Marc B Oct 14 '11 at 20:34
  • @Grigor: it's initialized/populated by preg_match – Marc B Oct 14 '11 at 20:34

2 Answers2

2

Well, in more general sense, you did wrong when trying to parse HTML with regexps, but regarding the snippet of code you have provided, the problem is that the ungreedy modifier tells *, + and {n,} to stop as soon as they are happy instead of going all the way.

So it essentially affects where the matching ends instead of where it begins - "ungreedy" is not intended to mean "give me the shortest" match possible.

You can kind of like fix this particular example using mU modifiers instead of sU, so that . don't match new lines.

Fluffy
  • 27,504
  • 41
  • 151
  • 234
  • 2
    +1. "greedy" and "non-greedy" are misnomers. If we called them "eager" and "reluctant" instead, we might prevent some of this confusion. It seems like everybody has to learn this lesson the hard way. (FYI, there's no need to add the `m` modifier; just remove the `s`.) – Alan Moore Oct 15 '11 at 07:21
0

My array is turning up empty as well. You have to be careful about linebreaks when you try to use Regex with HTML. There may be an issue with single line mode.

See: http://www.regular-expressions.info/dot.html

I've successfully parsed HTML with regex but I wouldn't do it going forward. Look into

http://simplehtmldom.sourceforge.net/

You will never look back.

Len
  • 542
  • 1
  • 5
  • 11