PHP Regex dot matches new line alternative

Question

I am come up with a regex to grab all text between 2 HTML tags. This is what I have so far:

<TAG[^>]*>(.*?)</TAG>

In practice, this should work perfectly. But executing it in PHP preg_replace with options: /ims results in the WHOLE string getting matched.

If I remove the /s tag, it works perfectly but the tags have newlines between them. Is there a better way on approaching this?

"In practice, this should work perfectly." This is exactly why you shouldn't use regular expressions to parse HTML, because everything works perfectly until you try to actually use it. Use a DOM parser instead. — CanSpice, Mar 24 '11 at 18:19
You cannot reliably parse HTML with regular expressions. They are not up to the task. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php.html for examples of how to properly parse HTML with PHP modules. — Andy Lester, Dec 21 '12 at 03:44

score 3 · Accepted Answer · edited May 23 '17 at 12:11

3

Of course there's a better way. Don't parse HTML with regex.

DOMDocument should be able to accommodate you better:

$dom = new DOMDocument();
$dom->loadHTMLFile('filename.html');

$tags = $dom->getElementsByTagName('tag');

echo $tags[0]->textContent; // Contents of `tag`

You may have to tweak the above code (hasn't been tested).

edited May 23 '17 at 12:11

Community

1
1

answered Mar 24 '11 at 18:19

Thank you for this. Going to give DOM a try! – curiousgeorge Mar 24 '11 at 18:47

score 1 · Answer 2 · answered Apr 12 '11 at 18:16

1

I don't recommend use regex to match in full HTML, but, you can use the "dottal" flag: /REGEXP/s

Example:

$str = "<tag>
fvox
</tag>";

preg_match_all('/<TAG[^>]*>(.*?)</TAG>/is', $str, $r);
print_r($r); //dump

answered Apr 12 '11 at 18:16

fvox

1,077
6
8

PHP Regex dot matches new line alternative

2 Answers2