1

If I have a block of HTML and want to get the exact HTML content for certain nodes and child nodes, for example the <ul> block below, should I use something like preg_match or parse the content or something like DOM Parsing?

Input

<html>
<head>
</head>
<body>
<h2>List</h2>
<ul class="my-list" id="my-list">
    <li class="item first">item1</li>
    <li class="item second">item2</li>
    <li class="item third">item3</li>
</ul>
</body>
</html>

Desired output

<ul class="my-list" id="my-list">
    <li class="item first">item1</li>
    <li class="item second">item2</li>
    <li class="item third">item3</li>
</ul>

As you can see I want to preserve all the attributes (classes, ids, etc).

I know that with DOM parsing I can access all of those attributes ($items->item($i)->getAttribute('class')), but can DOM handle easily (and automatically) rebuilding just a section of the original code without having to manually loop through and build the HTML? (I know DOM has echo $DOM->saveXML(), but iI believe that is just for the entire page.

I know how I can accomplish this with regex and PHP fairly easily, but I'm thinking that is not a good practice.

This is so simple with jQuery:

jQuery('ul').clone()

How can I achieve the same thing with PHP? (grabbing remote HTML, and getting a slice of it using DOM and outputting it as HTML again)

Community
  • 1
  • 1
cwd
  • 53,018
  • 53
  • 161
  • 198
  • If your HTML is simple and predictable, there's nothing wrong in using regex - see [this answer](http://stackoverflow.com/a/4231482/825789) (and [the treatise below it](http://stackoverflow.com/a/4234491/825789)). – bfavaretto Apr 28 '12 at 03:24
  • @bfavaretto - yes but when is html ever simple and predictable? – pguardiario Apr 29 '12 at 09:03

3 Answers3

2

It's not that bad with dom functions, maybe a bit more verbose than it should be:

$dom = new DOMDocument();
@$dom->loadHTML($html);
# or 
# @$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
echo $dom->saveXML($xpath->query("//ul")->item(0));
pguardiario
  • 53,827
  • 19
  • 119
  • 159
1

I suggest using DOM parsing, because it will be more maintainable if HTML structure changes, and it's easier to understand (read code) than a regexp.

Mārtiņš Briedis
  • 17,396
  • 5
  • 54
  • 76
  • I know I didn't ask for it explicitly in the original question, but could you provide an example of how to grab the `ul` block using DOM and preserving the node attributes? – cwd Apr 28 '12 at 15:58
0

It depends how much you trust in the data source. Is it going to be consistent? Could there be errors in the markup? Do you know what to expect?

If it's as simple or relatively close as your sample, I see no reason regex isn't a perfectly valid choice here.

It gets more difficult if, for example, there are multiple <ul>'s. So long as there is something uniquely identifying it or it is always in the same order, it shouldn't be a problem though.

dtbarne
  • 8,110
  • 5
  • 43
  • 49