PHP preg_match() ungreedy match issue

Question

I use PHP pattern modifier "U" to invert the default greedy behavior with preg_match(). However, it doesn't work the way I want. My code:

$str = '<p>
<div><a aaa
    <a href="a.mov"></a>
  </div>
</p>';

$needle = "a.mov";

$pattern = "/\<a.*".preg_quote($needle, "/").".*\<\/a\>/sU";

preg_match($pattern, $str, $matches);
print_r($matches);

I'm trying to match on

<a href="a.mov"></a>

But this chunk of code returns me

<a aaa
    <a href="a.mov"></a>

Can someone shed me some light of where I did wrong?

your $matches variable doesn't equal anything, does it? How do you print it when its not initialized — Grigor, Oct 14 '11 at 20:33
Check this out: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags and then rewrite this to use DOM operations instead of Regexes. Your broken ` — Marc B, Oct 14 '11 at 20:34

score 2 · Answer 1 · answered Oct 14 '11 at 20:53

2

Well, in more general sense, you did wrong when trying to parse HTML with regexps, but regarding the snippet of code you have provided, the problem is that the ungreedy modifier tells *, + and {n,} to stop as soon as they are happy instead of going all the way.

So it essentially affects where the matching ends instead of where it begins - "ungreedy" is not intended to mean "give me the shortest" match possible.

You can kind of like fix this particular example using mU modifiers instead of sU, so that . don't match new lines.

answered Oct 14 '11 at 20:53

Fluffy

27,504
41
151
234

2

+1. "greedy" and "non-greedy" are misnomers. If we called them "eager" and "reluctant" instead, we might prevent some of this confusion. It seems like everybody has to learn this lesson the hard way. (FYI, there's no need to add the `m` modifier; just remove the `s`.) – Alan Moore Oct 15 '11 at 07:21

score 0 · Answer 2 · answered Oct 14 '11 at 20:44

My array is turning up empty as well. You have to be careful about linebreaks when you try to use Regex with HTML. There may be an issue with single line mode.

See: http://www.regular-expressions.info/dot.html

I've successfully parsed HTML with regex but I wouldn't do it going forward. Look into

http://simplehtmldom.sourceforge.net/

You will never look back.

PHP preg_match() ungreedy match issue

2 Answers2