-1

I have a server app which will run some long-running PHP scripts in the background via CLI. One of these is a simple spider which will go through a list of websites and fetch their content using cURL.

When the function that does the work is part of a page accessed by the browser it works fine. When I punt the work to a PHP script running in CLI, sites behind cloudflare fail stating "Please enable cookies." and then going into detail that I am blocked.

This is the PHP function:

static function getPage($url, $timeout = 5)
{
    $agent= 'Mozilla/5.0 (compatible; SimpleSpiderBot/0.1; +'.url('/').')';
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($ch, CURLOPT_VERBOSE, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);

    $html = @curl_exec($ch);
    curl_close($ch);
    return $html;
}

What confuses me is that the PHP doing the work is all the same, it's only the PHP environment (CLI vs Apache request) which is different. I tried setting the PHP CLI command to use the same PHP.ini file as the page, which didn't work.

Edit: Cookie handling code was added, but when that failed to solve the problem I removed the excess code for clarity.

Kver
  • 767
  • 5
  • 19
  • Try making a packet capture between your server and the failing sites, and compare the HTTP headers in the two cases. – Barmar Jan 25 '19 at 21:20
  • what does the CURLOPT_VERBOSE logs say? also what IP does apache have and what IP does cli have? the same ip? – hanshenrik Jan 25 '19 at 21:36
  • @Barmar Wireshark is swamping me with data pretty badly, is there a nice Linux novice-packet-sniffer app you would recommend? I don't usually work on a packet level. – Kver Jan 25 '19 at 21:41
  • @hansenrik Should be the same IP, I'll verbose it up and see what it says. – Kver Jan 25 '19 at 21:41
  • Wireshark is my preference for graphical sniffer, I also use tcpdump from CLI. You should be able to filter it to just the connection between your server and a specific website, to reduce the swamping. – Barmar Jan 25 '19 at 21:43
  • May have just solved it. I'm going to do some testing, if it all works out I'll explain how I might be an idiot. – Kver Jan 25 '19 at 21:51
  • 1
    Any update? Who's the idiot, you or the Cloudflare people? – Barmar Jan 26 '19 at 00:39
  • Both IMHO. Cloudflare has a 'safety and security' mechanism which will block out 'malformed requests'. When running my curl function in cli mode that url('/') function would produce "/", whereas when run from browser it would produce something like "example.com/". Cloudflare saw that and decided the lone slash was a hazard to the entire request, blocking it. After removing the URL from the agent string entirely it worked fine. I was an idiot not paying attention to what my functions were doing, Cloudflare is also moronic for blocking a request over the agent having a lone slash in it. – Kver Jan 28 '19 at 18:50
  • If you create a curl request for **/**, then that has nothing to do with cloudflare. That'll be dependent on your computer's local settings. running `curl -v /` on my workstation makes no request to any domain, and just returns " malformed". A single slash is not a valid url. And you could hardly blame Cloudflare for wanting to protect its servers against malformed urls. – S. Imp Jan 28 '19 at 19:04

2 Answers2

0

The issue is that CloudFlare will attempt to validate several aspects of the request, but it doesn't necessarily say what is "malformed". In this case, the url() function I wrote returned a "/" when running in the background, as opposed to the full url such as "example.com/" as it would in the browser. This meant the user agent would read "Mozilla/5.0 (compatible; SimpleSpiderBot/0.1; /)", which Cloudflare apparently didn't like.

My advice to developers stumbling into this question would be to thoroughly check every header and option to see if Cloudflare might be getting "nitpicky" about the content, as it seems even a slight "malformation" will block a request.

Kver
  • 767
  • 5
  • 19
  • As I mentioned in the comments above, a single slash is not a valid url. This has nothing to do with cloudflare. – S. Imp Jan 28 '19 at 19:06
  • 2
    Where did he say that `$url` is a single slash? He's talking about `url('/')` that's used in setting `$agent`. – Barmar Jan 28 '19 at 21:14
-1

I cannot think of any reason why your web server would succeed but CLI fail when it comes to cookies. According to haxx.se, which I believe is the official site for curl, curl does not handle cookies unless you explicitly tell it to. I believe your script above will not bother to handle cookies at all by default. That you get any correct behavior at all if the site demands cookies suggests that you've overlooked something or that your problem lies somewhere else.

Note that you can set up your curl request to accept cookies as described here.

S. Imp
  • 2,833
  • 11
  • 24
  • I did at one point do just that; it had no effect, so I striped out the excess code again before posting. – Kver Jan 25 '19 at 21:42
  • If the friendly down-voter would explain the reason for the down vote I'll happily modify or delete my answer. If you'd like to make sure your curl script handles cookies, the question has already been asked before. – S. Imp Jan 25 '19 at 22:19
  • @Kver I have edited my response to provide additional detail. – S. Imp Jan 25 '19 at 22:38
  • I wasn't the one who downvoted it, the only reason I didn't upvote was because I tried cookies first when it told me cookies were the problem; I should have mentioned that in the OP though. – Kver Jan 28 '19 at 18:55