4

Using PHP and preg_match_all I'm trying to get all the HTML content between the following tags (and the tags also):

<p>paragraph text</p>
don't take this
<ul><li>item 1</li><li>item 2</li></ul>
don't take this
<table><tr><td>table content</td></tr></table>

I can get one of them just fine:

preg_match_all("(<p>(.*)</p>)siU", $content, $matches, PREG_SET_ORDER);

Is there a way to get all the

<p></p> <ul></ul> <table></table>

content with a single preg_match_all? I need them to come out in the order they were found so I can echo the content and it will make sense.

So if I did a preg_match_all on the above content then iterated through the $matches array it would echo:

<p>paragraph text</p>
<ul><li>item 1</li><li>item 2</li></ul>
<table><tr><td>table content</td></tr></table>
moinudin
  • 134,091
  • 45
  • 190
  • 216
Marcus
  • 4,400
  • 13
  • 48
  • 64
  • 2
    [Use an XML parser.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – eykanal Dec 27 '10 at 00:31
  • @mario: It's kind of both actually, just a little exaggerated. The important bit is: "Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions." – netcoder Dec 27 '10 at 00:38
  • You should [use an XML parser.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) (go ahead, flag this comment too) –  Dec 27 '10 at 16:21
  • **Don't use regular expressions to parse HTML. Use a proper HTML parsing module.** You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/php for examples of how to properly parse HTML with PHP modules that have already been written, tested and debugged. – Andy Lester Aug 09 '13 at 14:47

4 Answers4

11

Use | to match one of a group of strings: p|ul|table

Use backreferences to match the approriate closing tag: \\2 because the group (pl|ul|table) includes the second opening parenthesis

Putting that all together:

preg_match_all("(<(p|ul|table)>(.*)</\\2>)siU", $content, $matches, PREG_SET_ORDER);

This is only going to work if your input html follows a very strict structure. It cannot have spaces in the tags, or have any attributes in tags. It also fails when there's any nesting. Consider using an html parser to do a proper job.

moinudin
  • 134,091
  • 45
  • 190
  • 216
4

This one work for me

preg_match_all("#<\b(p|ul|table)\b[^>]*>(.*?)</\b(p|ul|table)\b>#si", $content, $matches)
James
  • 141
  • 2
  • 4
1

If you are to use a DOM parser, and you should, here's how. A contributor posted a useful function for obtaining a DOMNode's innerHTML, which I will use in the following example:

$dom = new DOMDocument;
$dom->loadHTML($html);

$p = $dom->getElementsByTagName('p')->item(0); // first <p> node
$ul = $dom->getElementsByTagName('ul')->item(0); // first <ul> node
$table = $dom->getElementsByTagName('table')->item(0); // first <table> node

echo DOMinnerHTML($p);
echo DOMinnerHTML($ul);
echo DOMinnerHTML($table);
netcoder
  • 66,435
  • 19
  • 125
  • 142
0

While doable with regular expressions, you could simplify the task by using one of the simpler HTML parser toolkits. For example with phpQuery or QueryPath it's as simple as:

qp($html)->find("p, ul, table")->text();   // or loop over them
Community
  • 1
  • 1
mario
  • 144,265
  • 20
  • 237
  • 291