0

I'm using DOMXPath to get the content of specific nodes. For my problem, I want to get all the text of the matching divs except that of nested divs.

$html = 
'<div itemscope="itemscope" itemtype="http://schema.org/Event">
  <span itemprop="name"> Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1)</span>
  <meta itemprop="startDate" content="2016-04-21">
    Thu, 04/21/16
    8:00 p.m    
  <div itemprop="offers" itemscope="itemscope" itemtype="http://schema.org/AggregateOffer">
    Priced from: <span itemprop="lowPrice">$35</span>
    <span itemprop="offerCount">1938</span> tickets left
  </div>
  <meta itemprop="endDate" content="2020-3-2"> end date of year    
  <div itemprop="attendee" itemscope="itemscope" itemtype="http://schema.org/Person">
     <span itemprop="name">Jane Doe</span>
     <meta itemprop="birthDate" content="1975-05-06"> 
    <div itemprop="sibling" itemscope="itemscope" itemtype="http://schema.org/Person">
        <span itemprop="name">Fatima Zohra</span>
        <meta itemprop="birthDate" content="1991-6-5">Jan 6
     </div>      
  </div>
</div>';

I first tried the following but this did not return the nested divs:

$tags = $xpath->query("//div[@itemscope='itemscope'][not(self::div)]/text()");

My current attempt is the following, but does not work:

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[not(ancestor::div)]');

foreach ($tags as $node) {
    echo $node->nodeValue; // body

}
hakre
  • 193,403
  • 52
  • 435
  • 836
Fatima Zohra
  • 2,929
  • 2
  • 17
  • 17

2 Answers2

1

This problem could best be split into two parts:

  1. Return a list of matching divs
  2. Print all content of each div EXCEPT the content of containing divs

The following demonstrates this approach:

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$divs = $xpath->query("//div[@itemscope='itemscope']");

foreach ($divs as $div) {
        $nodelist = $xpath->query('child::node()[not(self::div)][normalize-space()]',$div);

        foreach ($nodelist as $node) {
                echo $node->nodeValue . "\n";
        }
        echo "\n---------------------\n";
}

Note the following:

  • 'child::node()' instead of '*' includes text nodes
  • '[normalize-space()] removes redundant whitespace, including newlines

As an aside, 'not(ancestor::div)' specifically says not to return divs nested in other divs.

Mauritz Hansen
  • 4,674
  • 3
  • 29
  • 34
0

The microdata you're looking for is with the itemprop, itemscope, itemtype and the content attribute(s).

So your question is actually about how to obtain the microdata from that HMTL document. Which is basically a question of XML parsing. As the schema.org microdata is (more or less straight forward), I highly suggest to use DOMDocument to load the HMTL document but SimpleXML to parse the data.

The parsing in the libxml based PHP XML extensions won't work straight forward with xpath alone, because the library supports xpath 1.0 only and you can not do everything with that xpath version. Especially in this scenario to only select descendant-or-self with a specific attribute relative to a contextnode that do not contain children with that specific attribute again. So that always requires some wrapping code. If you're interested to read more about that, I found the following question that circles around a similar xpath problem to yours:

So instead wrap the xpath code inside some class and access the data interested straight away:

$dom = new DOMDocument;
$dom->loadHTML($html);

$micro = new Micro($dom);
$event = $micro->Event;

foreach($event as $name => $value) {
    if ($value->isEmbed()) continue;
    printf("%s => %s\n", $name, $value);
}

Gives the following output:

name =>  Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1)
startDate => 2016-04-21
endDate => 2020-3-2

Or you just access:

$micro->Event->name; # Miami Heat at Philadelphia 76ers - Game 3 (Home Game 1)

The Micro Microdata class as gist.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836