2

I was just about to attempt scraping using the Simple HTML DOM Framework: http://simplehtmldom.sourceforge.net/ but turns out file_get_contents is disabled in the server configuration for security reasons.

I now need to find a similar framework that uses Curl - anybody know of anything?

The error message I get when trying to run the slash dot example is:

Warning: file_get_contents() [function.file-get-contents]: URL file-access is disabled in the server configuration in /var/www/vhosts/domain.com/httpdocs/crawlfeed/simple_html_dom.php on line 70

martincarlin87
  • 10,848
  • 24
  • 98
  • 145
  • possible duplicate of [How to parse and process HTML with PHP?](http://stackoverflow.com/questions/3577641/how-to-parse-and-process-html-with-php) – mario Jan 13 '12 at 16:02
  • Cant you just cURL the file and then load the text string into SimpleHTMLDOM? – prodigitalson Jan 13 '12 at 16:02
  • you don't HAVE to use file_get_contents with simplehtml. You can fetch the html yourself with curl and feed the results to simplehtml directly. – Marc B Jan 13 '12 at 16:02
  • Also you could really just do the curl request separately, and pass in the string. `$dom = str_get_html(curl($url)->returntransfer(1)->exec());` – mario Jan 13 '12 at 16:05

3 Answers3

6

Just pull the page down with cURL, then load the string into SimpleHTMLDOM:

$ch = curl_init('http://theurl.com');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$htmlStr = curl_exec($ch);
curl_close($ch);

$html = new simple_html_dom();

// Load HTML from a string
$html->load($htmlStr);
prodigitalson
  • 60,050
  • 10
  • 100
  • 114
4

If you have PHP 5.3 (you should, as PHP 5.2 isn't supported anymore) I totaly recommand you Goutte.

It's kind of new, and it's just a .phar to include in your project. The HTTP part is taken care of by Http Zend and a socket. And you have the powerfull BrowserKit and DomCrawler Symfony Components to help you extract infos from HTML (no regex, no xpath).

Damien
  • 5,872
  • 2
  • 29
  • 35
1

Just use cURL to get the HTML code and then parse the html code using XPATH or Regular Expressions. Using XPATH is a good idea as it is a language specifically for parsing XML or (X)HTML as you want to use.

There is a good example here: http://www.2basetechnologies.com/screen-scraping-with-xpath-in-php

Nidhin Baby
  • 1,618
  • 4
  • 14
  • 16
Daniel West
  • 1,808
  • 2
  • 24
  • 34