php preg_match_all() how to get correct values in match-array

Question

The following situation:

$text = "This is some <span class='classname'>example</span> text i'm writing to
demonstrate the <span class='classname otherclass'>problem</span> of this.<br />";

preg_match_all("|<[^>/]*(classname)(.+)>(.*)</[^>]+>|U", $text, $matches, PREG_PATTERN_ORDER);

I need an array ($matches) where in one field is "<span class='classname'>example</span>" and in another "example". But what i get here is one field with "<span class='classname'>example</span>" and one with "classname".

It also should contain the values for the other matches, of course.

how can i get the right values?

Best advice: forget regexes exist, and switch to using DOM. It'll take you far less time to come up with a nice simple XPath query and a few dom node-extraction calls than it will to get the equivalent regex working - plus you won't beat your brain into a pulp doing so. — Marc B, Aug 27 '12 at 15:25
Die Cthulu, die!! Go back from whence you came... how long... noooo darkness reigns supreme [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) **JUST PARSE THE HTML** — Elias Van Ootegem, Aug 27 '12 at 15:25
[The pony, he comes...](http://stackoverflow.com/a/1732454/1338999) — Matt, Aug 27 '12 at 15:26
I have to agree, there are better ways for parsing HTML (as linked above). However, have you tried dumping your $matches variable? A copy paste of your code and a var_dump, provided me with $matches[3] as an array containing the values you were looking for. — Chris, Aug 27 '12 at 15:29
Just a slight remark: Why would anyone use pipes as regex delimiters?? that's like amputating a limb, IMHO — Elias Van Ootegem, Aug 27 '12 at 15:48

score 0 · Answer 1 · answered Aug 27 '12 at 15:30

The safe/easy way:

$text = 'blah blah blah';

$dom = new DOM();
$dom->loadHTML($text);

$xp = new DOMXPath($dom);

$nodes = $xp->query("//span[@class='classname']");
foreach($nodes as $node) {
    $innertext = $node->nodeValue;
    $html =  // see http://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument
}

score 0 · Accepted Answer · answered Aug 27 '12 at 15:35

You would be better off with a DOM parser, however this question is more to do with how capturing works in Regexes in general.

The reason you are getting classname as a match is because you are capturing it by putting () around it. They are completely unnecessary so you can just remove them. Similarly, you don't need them around .+ since you don't want to capture that.

If you had some group that you had to enclose in () as grouping rather than capturing, start the group with ?: and it won't be captured.

php preg_match_all() how to get correct values in match-array

2 Answers2