How to get a CSV from a site with a 503 error with CloudFlare protection?

Question

I have created a PHP script which gets a CSV from an external site using fopen and fgetcsv to store the data into an array.

The external site sporadically throws 503 errors. When this occurs fopen will not work and returns an error that the website is unavailable.

The external site in question continues to work fine via browser as it is protected using Cloud Flare.

Is there any way to still get the CSV in this scenario? I imagine by somehow mimicking a browser in my script to get the file...? May not be possible but need confirmation.

LSerni · Answer 1 · 2017-12-10T11:55:35.357

There is no way you can bypass the CloudFlare protection using User-Agent or the like, because, if it was possible, then CloudFlare's wouldn't be any security at all.

What is probably happening is that either the backend has failed, but CloudFlare can allow the browser to use a cached response, or that the failing is intermittent and the browser still works because it's the next call. It might well happen that your CSV-scraper succeeds and the browser fails, and you do not know because when the scraper succeeds... you don't check with the browser at all, as you've no reason to.

As for what can you do, yes, you can emulate a human being with a browser. You do this by caching any successful responses together with a timestamp, and by retrying after a short pause when you get an error.

function scrapeCSV($retries = 3) {
    if (0 === $retries) {
        // return an invalid response to signify an error
        return null;
    }
    $fp = @fopen(...);
    if (!$fp) {
       // failed.
       sleep(1);
       return scrapeCSV($retries - 1);
    }
    ...
    return $csv;
}

UPDATE

To access the second-level cache "as a browser would do" you probably need to cross-breed two different solutions: how to "fake" a browser connection and how to read from curl as if it was a stream (i.e. fopen).

If you're cool with recovering the whole CSV in one fell swoop, and parse it later once you've got it as a local file, then you only need the first answer (there's a more upvoted, more detailed and procedural answer below the one I linked - the one I linked is mine ;-) ).

When I say sporadically, I mean the external site can be down for hours. If you access the external site by browser Cloud Flare even tells you it is checking your browser and redirecting you to their version. — bigdaveygeorge, Dec 10 '17 at 11:41
Okay, I'm afraid that caching it is, then. Or you can try and leverage CloudFlare's second-level cache by using `curl` and setting the option to follow redirects. You probably still need to verify the data you get make sense. — LSerni, Dec 10 '17 at 11:45
If I try to curl I get redirected to this 404 address: http://localhost/cdn-cgi/l/chk_jschl?jschl_vc=b5eb05aeaa403ee52ead48afaa2179f2&pass=1512906506.179-N%2Fgi2i3HMp&jschl_answer=12483 — bigdaveygeorge, Dec 10 '17 at 11:49
I added a couple of pointers that *ought* to see you through. Let me know how it pans out. — LSerni, Dec 10 '17 at 11:56

score 0 · Answer 2 · answered Dec 10 '17 at 11:37

Cloudflare support site says:

On the other hand, a 503 Service Temporarily Unavailable error message with "cloudflare-nginx" in it means you are hitting a connection limit in a Cloudflare datacenter. Please contact Cloudflare support with the following information: link

If the site work with broswers, it might be allowing connection only from broswers to save bandwith, but I think that, when your server contact the site, the connection limit is reached, so it doesn't depend on your server.

You can still try to use curl to emulate a normal broswer and try if it works.

    <?php $url="https://example.com";
 $agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
 $ch = curl_init(); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
 curl_setopt($ch, CURLOPT_VERBOSE, true); 
curl_setopt($ch,CURLOPT_RETURNTRANSFER, true); 
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
 curl_setopt($ch, CURLOPT_URL,$url);
 $result=curl_exec($ch);
 var_dump($result);
?>

But still JavaScript won't load and the site may notice it.

So when attempting this it var dumps out Cloud Flares redirect message, so it checks the browser and then attempts to redirect, you then get a 404, I assume because of a mismatch in the user. — bigdaveygeorge, Dec 10 '17 at 11:42

How to get a CSV from a site with a 503 error with CloudFlare protection?

2 Answers2

UPDATE