1

I am come up with a regex to grab all text between 2 HTML tags. This is what I have so far:

<TAG[^>]*>(.*?)</TAG>

In practice, this should work perfectly. But executing it in PHP preg_replace with options: /ims results in the WHOLE string getting matched.

If I remove the /s tag, it works perfectly but the tags have newlines between them. Is there a better way on approaching this?

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
curiousgeorge
  • 271
  • 1
  • 4
  • 12
  • "In practice, this should work perfectly." This is exactly why you shouldn't use regular expressions to parse HTML, because everything works perfectly until you try to actually use it. Use a DOM parser instead. – CanSpice Mar 24 '11 at 18:19
  • You cannot reliably parse HTML with regular expressions. They are not up to the task. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php.html for examples of how to properly parse HTML with PHP modules. – Andy Lester Dec 21 '12 at 03:44

2 Answers2

3

Of course there's a better way. Don't parse HTML with regex.

DOMDocument should be able to accommodate you better:

$dom = new DOMDocument();
$dom->loadHTMLFile('filename.html');

$tags = $dom->getElementsByTagName('tag');

echo $tags[0]->textContent; // Contents of `tag`

You may have to tweak the above code (hasn't been tested).

Community
  • 1
  • 1
1

I don't recommend use regex to match in full HTML, but, you can use the "dottal" flag: /REGEXP/s

Example:

$str = "<tag>
fvox
</tag>";

preg_match_all('/<TAG[^>]*>(.*?)</TAG>/is', $str, $r);
print_r($r); //dump
fvox
  • 1,077
  • 6
  • 8