1

I have a HUGE HTML document that I need to parse. The document is a list of <p> elements all (direct) children of the body tag. The difference is the class name. The structure is like this:

    <p class="first-level"></p>
    <p class="second-level"></p>
    <p class="third-level"></p>
    <p class="third-level"></p>
    <p class="nth-levels just-for-demo-1"></p>
    <p class="nth-levels just-for-demo-1"></p>
    <p class="third-level"></p>
    <p class="second-level"></p>
    <p class="third-level"></p>
    <p class="nth-levels just-for-demo-2"></p>
    <p class="first-level"></p>
    <p class="second-level"></p>
    <p class="second-level"></p>
    <p class="third-level"></p>

And so on. nth-level can be any class name that isn't first-level, second-level or third-level. Basically it's a multi-level <ul> element very poorly marked-up.

What I want to do is parse it and obtain all <p> elements (including tag, not just innerHTML) that are between one of the class names above.

In the example above, I want to get:

<p class="nth-levels just-for-demo-1"></p>
<p class="nth-levels just-for-demo-1"></p>

and

<p class="nth-levels just-for-demo-2"></p>

How the heck can I do that please? Thank you.

Francisc
  • 77,430
  • 63
  • 180
  • 276

4 Answers4

2

Using XPath:

//p[not(@class='first-level')][not(@class='second-level')][not(@class='third-level')]

to get the (non?)matching nodes, then you can use this answerto get the outerHTML of the nodes.

Community
  • 1
  • 1
Marc B
  • 356,200
  • 43
  • 426
  • 500
  • Hm, that's clever, but doesn't XPath need valid XML? The code I'm looking at (which is horrific) is HTML4 and invalid XML. – Francisc Aug 31 '11 at 19:15
  • 1
    If you can run that 'garbage' through Tidy or Purifier without most of it getting lopped off as cancerous, then you can feed it to DOM and XPath. But otherwise, you'll have to use something else. DOM is extraordinarily picky about the html it'll accept. – Marc B Aug 31 '11 at 19:16
  • Good idea, I'll run it through that and do a damage assessment. Thanks (again now that I saw the name). – Francisc Aug 31 '11 at 19:21
1

Additionaly, if you're familiar with jQuery, then try jQuery port to PHP and you could have a powerful set of tools for matching a set of elements in a document (Selectors) as you used to be with jQuery along side with Hierarchy, Attribute Filters, Child Filters etc,Reference

toopay
  • 1,635
  • 11
  • 18
  • Haha, there's a port from anything to anything nowadays. I think this will work just fine. I'll give it a go. – Francisc Aug 31 '11 at 19:25
0
$doc = new DOMDocument;
$doc->loadHTML(...);
$query = '//p[contains(@class, "just-for-demo-")]';
$xpath = new DOMXPath($doc);
$entries = $xpath->query($query);

foreach ($entries as $entry)
{
  // not a best solution yet
  $attribute = '';
  foreach ($entry->attributes as $attr)
  {
    $attribute .= "{$attr->name}=\"{$attr->value}\"";
  }

  echo "<{$entry->nodeName}{$attribute}>{$entry->nodeValue}</{$entry->nodeName}>";
}
ajreal
  • 46,720
  • 11
  • 89
  • 119
  • Hey, thanks, but `just-for-demo-` was... well... just for demo... It doesn't need to be very very elegant. It will only run once. – Francisc Aug 31 '11 at 19:17
-1

You could open the file (with fopen or something similar) and read one line at a time. Then just check if the required string is in the line (for example with strstr) and if yes, then add it to an array or do what you need with the line. Note: this only works if the paragraphs are on different lines each.

fopen documentation

strstr documentation

Eduard Luca
  • 6,514
  • 16
  • 85
  • 137
  • Thanks. I'm not sure the paragraphs are on a single line. And I think an HTML parser is better for this. – Francisc Aug 31 '11 at 19:11