I'm trying to write my first crawler by using PHP with cURL library. My aim is to fetch data from one site systematically, which means that the code doesn't follow all hyperlinks on the given site but only specific links.
Logic of my code is to go to the main page and get links for several categories and store those in an array. Once it's done the crawler goes to those category sites on the page and looks if the category has more than one pages. If so, it stores subpages also in another array. Finally I merge the arrays to get all the links for sites that needs to be crawled and start to fetch required data.
I call the below function to start a cURL session and fetch data to a variable, which I pass to a DOM object later and parse it with Xpath. I store cURL total_time and http_code in a log file.
The problem is that the crawler runs for 5-6 minutes then stops and doesn't fetch all required links for sub-pages. I print content of arrays to check result. I can't see any http error in my log, all sites give a http 200 status code. I can't see any PHP related error even if I turn on PHP debug on my localhost.
I assume that the site blocks my crawler after few minutes because of too many requests but I'm not sure. Is there any way to get a more detailed debug? Do you think that PHP is adequate for this type of activity because I wan't to use the same mechanism to fetch content from more than 100 other sites later on?
My cURL code is as follows:
function get_url($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_URL, $url);
$data = curl_exec($ch);
$info = curl_getinfo($ch);
$logfile = fopen("crawler.log","a");
echo fwrite($logfile,'Page ' . $info['url'] . ' fetched in ' . $info['total_time'] . ' seconds. Http status code: ' . $info['http_code'] . "\n");
fclose($logfile);
curl_close($ch);
return $data;
}
// Start to crawle main page.
$site2crawl = 'http://www.site.com/';
$dom = new DOMDocument();
@$dom->loadHTML(get_url($site2crawl));
$xpath = new DomXpath($dom);