2

I am trying to access each table row of:

http://www.alliedelec.com/search/searchresults.aspx?N=0&Ntt=PIC16F648&Ntk=Primary&i=0&sw=n

with SimpleXML->xpath. I have identified the xpath of the table to be:

'//*[@id="tblParts"]'

Now I take my cURL string $string and do the following:

$tidy->parseString($string);
$output = (string) $tidy;
$xml = new SimpleXMLElement($output);
$result = $xml->xpath('//*[@id="tblParts"]');
while(list( , $node) = each($result)) 
{
echo 'NODE:' . $node . "\n";
}

What I get back are errors such as these, by the hundreds:

Warning: SimpleXMLElement::__construct() [simplexmlelement.--construct]: Entity: line 60: parser error : Opening and ending tag mismatch: meta line 22 and head in C:\xampp\htdocs\elexess\api\driver\driver_alliedelectronics.php on line 119

Warning: SimpleXMLElement::__construct() [simplexmlelement.--construct]: </head> in C:\xampp\htdocs\elexess\api\driver\driver_alliedelectronics.php on line 119

Warning: SimpleXMLElement::__construct() [simplexmlelement.--construct]: ^ in C:\xampp\htdocs\elexess\api\driver\driver_alliedelectronics.php on line 119

Warning: SimpleXMLElement::__construct() [simplexmlelement.--construct]: Entity: line 108: parser error : Opening and ending tag mismatch: img line 106 and td in C:\xampp\htdocs\elexess\api\driver\driver_alliedelectronics.php on line 119

As well as this at the end:

Fatal error: Uncaught exception 'Exception' with message 'String could not be parsed as XML' in C:\xampp\htdocs\app\com\get\get_alliedelectronics.php:119 Stack trace: #0 C:\xampp\htdocs\app\com\get\get_alliedelectronics.php(119): SimpleXMLElement->__construct('<!DOCTYPE html ...') #1 C:\xampp\htdocs\app\com\get\get_alliedelectronics.php(95): get_Alliedelectronics->extractData('<!DOCTYPE html ...') #2 C:\xampp\htdocs\app\com\get\get_alliedelectronics.php(138): get_Alliedelectronics->query('PIC16F648') #3 {main} thrown in C:\xampp\htdocs\app\com\get\get_alliedelectronics.php on line 119
Shog9
  • 156,901
  • 35
  • 231
  • 235
Dominik
  • 4,718
  • 13
  • 44
  • 58

2 Answers2

2

Looks like the HTML of the page you're fetching and trying to parse isn't well formed (tag mismatches etc.)

You can try and fix the errors using simplexml_import_dom as I explain in this SO post.

Community
  • 1
  • 1
Nev Stokes
  • 9,051
  • 5
  • 42
  • 44
  • Further, you need to be using tools appropriate for the data you are processing. If you plan to use XML methods, then writing good code demands that you can *guarantee* the input is well-formed, not just hope and guess by experiment. You can only trust an XML library to produce XML for you, so you have to use HTML methods to do the conversion and make the code safe if you a 'dirty' stage earlier in your processing. – Nicholas Wilson May 08 '11 at 14:50
  • I am not sure what other tools I could be using to extract data from this html file neither am I sure how to clean up the dirty code except for letting it run through tidy. – Dominik May 08 '11 at 14:53
1

I'd suggest not using SimpleXML (@Nev Stokes and @Nicholas Wilson are right: this is html, not XML and you have no guarantees that it will validate as XML) and use something like DOM (see http://www.php.net/manual/en/book.dom.php). You can do something like:

$doc = new DOMDocument();
$doc->loadHTML($string);
$xpath = new DOMXPath($doc);
$entries = $xpath->query('//*[@id="tblParts"]');
foreach ($entries as $entry) {
  // do something
}

See if that helps.

Femi
  • 64,273
  • 8
  • 118
  • 148