Getting DOM elements of html from file_get_contents

Question

I am fetching html from a website with file_get_contents. I have a table (with a class name) inside html, and I want to get the data inside html tags.

This is how I fetch the html data from url:

$url = 'http://example.com';
$content = file_get_contents($url);

The html looks like:

<table class="space">
   <thead></thead>
   <tbody>
      <tr>
         <td class="marsia">1</td>
         <td class="mars">
           <div>Mars</div>
         </td>
      </tr>
      <tr>
         <td class="earthia">2</td>
         <td class="earth">
           <div>Earth</div>
         </td>
      </tr>
   </body>
</table>

Is there a way to searh DOM elements in php like we do in jQuery? So that I can access the values 1, 2 (first td) and div's value inside second td.

Something like

a) search the html for table with class name space

b) inside that table, inside tbody, return each tr's 'first td's value' and 'div's value inside second td'

So I get; 1 and Mars, 2 and Earth.

Use [DOMDocument](http://php.net/manual/en/class.domdocument.php) to parse the HTML. — Barmar, Dec 17 '16 at 10:53

score 1 · Answer 1 · answered Dec 18 '16 at 12:10

Use the DOM extension, for example. Its DOMXPath class is particularly useful for such kind of tasks.

You can easily set the listed conditions with an XPath expression like this:

//table[@class="space"]//tr[count(td) = 2]/td

where - //table[@class="space"] selects all table elements from the document having class attribute value equal to "space" string; - //tr[count(td) = 2] selects all tr elements having exactly two td child elements; - /td represents the td elements.

Sample implementation:

$html = <<<'HTML'
<table class="space">
   <thead></thead>
   <tbody>
      <tr>
         <td class="marsia">1</td>
         <td class="mars">
           <div>Mars</div>
         </td>
      </tr>
      <tr>
         <td class="earthia">2</td>
         <td class="earth">
           <div>Earth</div>
         </td>
      </tr>
      <tr>
         <td class="earthia">3</td>
      </tr>
   </tbody>
</table>
HTML;

$doc = new DOMDocument;
$doc->loadHTML($html);

$xpath = new DOMXPath($doc);

$cells = $xpath->query('//table[@class="space"]//tr[count(td) = 2]/td');

$i = 0;
foreach ($cells as $td) {
  if (++$i % 2) {
    $number = $td->nodeValue;
  } else {
    $planet = trim($td->textContent);
    printf("%d: %s\n", $number, $planet);
  }
}

Output

1: Mars
2: Earth

The code above is supposed to be considered as a sample rather than an instruction for practical use, as it is not very scalable. The logic is bound to the fact that the XPath expression selects exactly two cells for each row. In practice, you may want to select the rows, iterate them, and put the extra conditions into the loop, e.g.:

$rows = $xpath->query('//table[@class="space"]//tr');

foreach ($rows as $tr) {
  $cells = $xpath->query('.//td', $tr);

  if ($cells->length < 2) {
    continue;
  }

  $number = $cells[0]->nodeValue;
  $planet = trim($cells[1]->textContent);
  printf("%d: %s\n", $number, $planet);
}

DOMXPath::query() is called with an XPath expression relative to the current row ($tr), then checks if the returned DOMNodeList contains at least two cells. The rest of the code is trivial.

You can also use SimpleXML extension, which also supports XPath. But the extension is much less flexible as compared to the DOM extension.

For huge documents, use extensions based on SAX-based parsers such as XMLReader.

Getting DOM elements of html from file_get_contents

1 Answers1