Converting html to url scraper

Question

So a very helpful guy as helped me get this far on Stackoverflow however I need to covert his code from HTMl to a URL to scrape I've tried over and over and I keep hitting errors any ideas?

function getElementByIdAsString($html, $id, $pretty = true) {
$doc = new DOMDocument();
@$doc->loadHTML($html);

if(!$doc) {
    throw new Exception("Failed to load $url");
}
$element = $doc->getElementById($id);
if(!$element) {
    throw new Exception("An element with id $id was not found");
}

// get all object tags
$objects = $element->getElementsByTagName('object'); // return node list

// take the the value of the data attribute from the first object tag
$data = $objects->item(0)->getAttributeNode('data')->value;

// cut away the unnecessary parts and return the info
return substr($data, strpos($data, '=')+1);

}

// call it:
$finalcontent = getElementByIdAsString($html, 'mainclass');

print_r ($finalcontent);

It just blanks out. is there a better way for me to get the errors? New to all this — Jamie, Nov 19 '15 at 17:13
I'm simply trying to place a URL to scrape rather then the $html example the guy did on stack overflow — Jamie, Nov 19 '15 at 17:13
First, remove the `@` as this will silence errors (avoid using it, really). Then add `error_reporting(E_ALL);` to report all errors. — camelCase, Nov 19 '15 at 17:15
Only error I'm getting is in the Chrome console "Failed to load resource: the server responded with a status of 500 (Internal Server Error)" Its not loading my wordpress footer so I assume its just causing errors during the scrape. — Jamie, Nov 19 '15 at 17:17
`500` can be many things. If you remove this function from the page, does it load properly? Essentially, you need to sort out where the error is located then you can sort how to solve it. — camelCase, Nov 19 '15 at 17:19
Had a session of removing chunks of the code its the final part where it trys to call the content that is where its bugging out. — Jamie, Nov 19 '15 at 17:22

score 1 · Accepted Answer · answered Nov 19 '15 at 17:26

1

Remember to try and catch when you use your function as it is likely to throw Exceptions which will cause a 500 Server error.

$finalcontent = getElementByIdAsString($html, 'mainclass');

should become

try {
    $finalcontent = getElementByIdAsString($html, 'mainclass');
}catch(Exception $e){
    echo $e->getMessage();
}

answered Nov 19 '15 at 17:26

Elijah

66
7

Thank you so much this has removed the error! Now for the main problem. I need this to be scraping from a URL how can I convert this chunk of code to read a URL rather than $html that it is currently doing. – Jamie Nov 19 '15 at 17:31
Depending on what hosting you have, you should be able to call `$html = file_get_contents($url);` that will take the URL you provide and try to fetch the HTML of that document, if that doesn't work you will probably have to look into cURL and you can fetch the HTML of the page that way! – Elijah Nov 19 '15 at 17:33
I assume by the fact its now white screened this won't work with wordpress on a custom linode? – Jamie Nov 19 '15 at 17:36
Weird i remove the whole script and simple just put in a $html = file_get_contents ('http://www.url.com') and echo it out and it worked fine however with the whole function it causes a error – Jamie Nov 19 '15 at 17:38
If you're hosting the website yourself on a linux machine or have access to the php.ini you can enable the `file_get_contents` method by adding/changing `allow_url_fopen = On` if not you should be able to run cURL [(cURL example)](http://stackoverflow.com/questions/3592270/php-get-html-source-code-with-curl) – Elijah Nov 19 '15 at 17:39
If the `try/catch` block removed the `500` error from before, it means there is an error in your function. It didn't "remove" the error, it just confirmed the error is present. You need to review what `$e->getMessage()` has echo'd. – camelCase Nov 19 '15 at 17:41
1

I HAVE GOT THE OUTPUT! – Jamie Nov 19 '15 at 17:42

Converting html to url scraper

1 Answers1