1

I'm trying to parse a table from an HTML webpage, but I'm having trouble.

Here is what my HTML approximately looks like :

<tbody>

<tr class="even">
<td class="time">Monday 20:10</td>
<td class="place">Paris 14</td>
</tr>

<tr class="odd">
<td class="time">Monday 21:00</td>
<td class="place">Paris 13</td>
</tr>

</tbody>

EDIT : Here is my PHP

<?php

$url = 'https://www.gymsuedoise.com/loc/dt/?id=64';


$options = array(
    CURLOPT_RETURNTRANSFER => true,     // return web page
    CURLOPT_HEADER         => false,    // don't return headers
    CURLOPT_FOLLOWLOCATION => true,     // follow redirects
    CURLOPT_ENCODING       => "",       // handle all encodings
    CURLOPT_USERAGENT      => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0", // something like Firefox 
    CURLOPT_AUTOREFERER    => true,     // set referer on redirect
    CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
    CURLOPT_TIMEOUT        => 120,      // timeout on response
    CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
);

$curl = curl_init($url); curl_setopt_array( $curl, $options ); $content = curl_exec($curl); curl_close($curl);
$dom = new DOMDocument(); @$dom->loadHTML($content); $xpath = new DOMXPath($dom); 

$tables = $dom->getElementsByTagName('tbody'); 
$rows = $tables->item(0)->getElementsByTagName('tr');

foreach ($rows as $row) 
{ 

$cols = $row->getElementsByTagName('td'); 

$date = $cols->item(0)->nodeValue; $liste_element[$i]['date'] = trim($date);
$intensite = $cols->item(2)->nodeValue; $liste_element[$i]['intensite'] = trim($intensite);
$animateur = $cols->item(3)->nodeValue; $liste_element[$i]['animateur'] = trim($animateur);
$forfait = $cols->item(5)->nodeValue; $liste_element[$i]['forfait'] = trim($forfait);

$i++;
} 

echo '<pre>';            
print_r ($liste_element);
echo '<pre>';            

?>

My issue is that my script can't scrape anything in the 6th column (i.e. item(5)) of the table, as there are only pictures and no text. How could I scrape the content in the alt or title attribute if the the <img> tag ?

Sᴀᴍ Onᴇᴌᴀ
  • 8,218
  • 8
  • 36
  • 58
Guillaume
  • 342
  • 1
  • 9
  • 23
  • 1
    Possible duplicate of [How to parse HTML table using PHP?](http://stackoverflow.com/questions/8816194/how-to-parse-html-table-using-php) – PseudoAj Dec 05 '16 at 19:11
  • I tried the code suggested in that answer but I get a `Fatal error: Call to a member function getElementsByTagName() on a non-object`. – Guillaume Dec 05 '16 at 19:38
  • I edited my question. The issue was related to a security message from the server of the website which was considering my script as a malware. – Guillaume Dec 05 '16 at 23:14
  • To get the attribute value (alt, title or any other) use attributes property, like this: $nodeList->item(0)->attributes->getNamedItem("alt")->nodeValue; – Boy Dec 06 '16 at 09:55

2 Answers2

2

The error that you are getting is from printing/echoing an object. Also you can't give a URL as an argument for loadHtml. You would need to do:

$fetchHtml = file_get_contents($html);
$html = $dom->loadHTML($fetchHtml);

But doing it that way you are going to run into some issues with whoever's server you are tying to scrape.

What I did was used a open sources PHP packaged called Guzzle.

You will need to install it in your directory using composer. To install composer just run:

curl -sS https://getcomposer.org/installer | php

Then open the composer.json file and put the following:

{
   "require": {
      "guzzlehttp/guzzle": "~6.0"
   }
}

Then run:

composer update

This will create get all the dependencies you will need to run Guzzle.

If you are on shared hosting then download Guzzle and upload it to your server.

github.com/guzzle/guzzle/releases

The new file will look like this:

<?php
require 'vendor/autoload.php';

$client = new GuzzleHttp\Client();
$dom = new DOMDocument();
$url = 'https://www.gymsuedoise.com/loc/dt/?id=64';

$res = $client->request('GET', $url, [
    'auth' => ['user', 'pass']
]);


$html = (string)$res->getBody();


// The @ in front of $dom will suppress any warnings
$domHtml = @$dom->loadHTML($html);

  //discard white space 
  $dom->preserveWhiteSpace = false;

  //the table by its tag name
  $tables = $dom->getElementsByTagName('tbody');


  //get all rows from the table
  $rows = $tables->item(0)->getElementsByTagName('tr');

  // loop over the table rows
  foreach ($rows as $row)
  {
   // get each column by tag name
      $cols = $row->getElementsByTagName('td');
   // echo the values  
      echo $cols->item(0)->nodeValue.'<br />';
      echo $cols->item(1)->nodeValue.'<br />';
      echo $cols->item(2)->nodeValue;
    }


?>

Keep in mind that this will only extract the first table from the html.

PHPGrandMaster
  • 358
  • 2
  • 10
0

One way to do this is to utilize the method DOMElement::getAttribute() on the image. To traverse the nodes down to the level of the image, utilize the DOMNode::$firstchild property of both the node, and then the anchor tag. To ensure that the $firstChild won't be NULL, use DOMNode::hasChildNodes().

if ($cols->item(5)->hasChildNodes()) {
    $anchor = $cols->item(5)->firstChild;
    if ($anchor->hasChildNodes()) {
        $altAttribute = $anchor->firstChild->getAttribute("alt"); 
        $liste_element[$i]['forfait'] = trim($altAttribute);
    }
}

For a demonstration, see this playground example.

Sᴀᴍ Onᴇᴌᴀ
  • 8,218
  • 8
  • 36
  • 58