1

I'm trying to write my first crawler by using PHP with cURL library. My aim is to fetch data from one site systematically, which means that the code doesn't follow all hyperlinks on the given site but only specific links.

Logic of my code is to go to the main page and get links for several categories and store those in an array. Once it's done the crawler goes to those category sites on the page and looks if the category has more than one pages. If so, it stores subpages also in another array. Finally I merge the arrays to get all the links for sites that needs to be crawled and start to fetch required data.

I call the below function to start a cURL session and fetch data to a variable, which I pass to a DOM object later and parse it with Xpath. I store cURL total_time and http_code in a log file.

The problem is that the crawler runs for 5-6 minutes then stops and doesn't fetch all required links for sub-pages. I print content of arrays to check result. I can't see any http error in my log, all sites give a http 200 status code. I can't see any PHP related error even if I turn on PHP debug on my localhost.

I assume that the site blocks my crawler after few minutes because of too many requests but I'm not sure. Is there any way to get a more detailed debug? Do you think that PHP is adequate for this type of activity because I wan't to use the same mechanism to fetch content from more than 100 other sites later on?

My cURL code is as follows:

function get_url($url)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
    curl_setopt($ch, CURLOPT_URL, $url);
    $data = curl_exec($ch);
    $info = curl_getinfo($ch);  
    $logfile = fopen("crawler.log","a");
    echo fwrite($logfile,'Page ' . $info['url'] . ' fetched in ' . $info['total_time'] . ' seconds. Http status code: ' . $info['http_code'] . "\n");
    fclose($logfile);
    curl_close($ch);

    return $data;
}

// Start to crawle main page.

$site2crawl = 'http://www.site.com/';

$dom = new DOMDocument();
@$dom->loadHTML(get_url($site2crawl));
$xpath = new DomXpath($dom);
James Webster
  • 31,873
  • 11
  • 70
  • 114
g0m3z
  • 691
  • 1
  • 12
  • 25
  • I found this line in my LAMPP erro_log: [:error] [pid 2996] [client 127.0.0.1:49848] PHP Fatal error: Maximum execution time of 30 seconds exceeded in /opt/lampp/htdocs/clw/clw.php on line 73. I'll try to increase timeout for cURL and retry. – g0m3z Dec 31 '12 at 19:58
  • I increased the timeout parameter then changed to zero but it did not help. – g0m3z Dec 31 '12 at 20:11
  • Have you seen if curl is getting any errors? Something like this should work: `if( $data == false ) { fwrite( $logfile, curl_error( $ch ); ) }` – Chris Ostmo Dec 31 '12 at 20:42
  • By 'increase timeout for cURL' do you mean you used [set_time_limit](http://php.net/manual/en/function.set-time-limit.php)? – Quentin Skousen Dec 31 '12 at 20:47
  • @kkhugs: I set CURLOPT_CONNECTTIMEOUT parameter to zero but it did not help. – g0m3z Dec 31 '12 at 21:08
  • @Gomez: Try using `set_time_limit(0);`. Even with `CURLOPT_CONNECTTIMEOUT` set, your PHP script will still time out. – Quentin Skousen Dec 31 '12 at 21:10
  • To clarify further: `CURLOPT_CONNECTTIMEOUT` is used to set the amount of time cURL will wait for a page to load before it times out. `set_time_limit` is used to set the amount of time your PHP script itself can run before assuming it's stuck in an endless loop and killing itself. – Quentin Skousen Dec 31 '12 at 21:16
  • @ChrisOstmo: I did not try this but probably found the problem already. Later in my code I create new DOM object and new Xpath within a foreach loop more than 1000 times. This probably causes memory leak. I fond this post about the same: http://stackoverflow.com/questions/8379829/domdocument-php-memory-leak/8379947#8379947 . But I don't know how can I implement it in my code for a foreach loop like this: 'foreach (array_slice($aProdCat,1) as $ProdCatPage){ $domCat = new DOMDocument(); @$domCat->loadHTML(get_url($site2crawl.$ProdCatPage)); $xpathCat = new DomXpath($domCat);}' – g0m3z Dec 31 '12 at 21:16
  • @kkhugs: Thanks for this! I think I found the problem. Please see my post above. It's a memory leak issue since my code implements new DOM object within a foreach loop more than 1000 times. I'm looking for a solution for this issue now. Thanks once again! – g0m3z Dec 31 '12 at 21:35
  • 1
    Thanks to **kkhugs** who suggested to set the time limit to zero within the code. It helped. The following code solved my issue: `set_time_limit(0);` I also implemented the code which can be found here to avoid memory leak issue. Thread can be closed. Thanks for everyone! gomez – g0m3z Dec 31 '12 at 23:25
  • @gomez: I've submitted my suggestion as an answer. Please accept it so this question can be considered answered. Questions here are only closed in extreme cases. Happy new year! – Quentin Skousen Jan 01 '13 at 12:14
  • There are lots of really good, stable, effective and efficient spiders available open-source, whey write another one? to list just one https://github.com/scrapy/scrapy – Toby Allen Jan 01 '13 at 12:15
  • Thanks @TobyAllen! I'll certainly take a look on it. – g0m3z Jan 01 '13 at 14:11

2 Answers2

1

Use set_time_limit to extend the amount of time your script can run for. That is why you are getting Fatal error: Maximum execution time of 30 seconds exceeded in your error log.

Quentin Skousen
  • 1,035
  • 1
  • 18
  • 30
0

do you need to run this on a server? If not, you should try the cli version of php - it is exempt from common restrictions

user1938139
  • 141
  • 1
  • 5
  • Yes, I would like to run it on a server later on in production. – g0m3z Dec 31 '12 at 21:45
  • why would you not be able to run the cli version on a server? – Toby Allen Jan 01 '13 at 12:15
  • Thanks @TobyAllen my issue was solved already. I'll have enough time to figure out later how will I implement this in production. I'm going to improve my crawler code first (with parallel threads for example). – g0m3z Jan 01 '13 at 13:54