I have a server app which will run some long-running PHP scripts in the background via CLI. One of these is a simple spider which will go through a list of websites and fetch their content using cURL.
When the function that does the work is part of a page accessed by the browser it works fine. When I punt the work to a PHP script running in CLI, sites behind cloudflare fail stating "Please enable cookies." and then going into detail that I am blocked.
This is the PHP function:
static function getPage($url, $timeout = 5)
{
$agent= 'Mozilla/5.0 (compatible; SimpleSpiderBot/0.1; +'.url('/').')';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$html = @curl_exec($ch);
curl_close($ch);
return $html;
}
What confuses me is that the PHP doing the work is all the same, it's only the PHP environment (CLI vs Apache request) which is different. I tried setting the PHP CLI command to use the same PHP.ini file as the page, which didn't work.
Edit: Cookie handling code was added, but when that failed to solve the problem I removed the excess code for clarity.