5

I have a PHP script that connects to an URL through cURL and then does something, depending on the returned HTTP status code:

$ch = curl_init();
$options = array(
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_URL            => $url,
            CURLOPT_USERAGENT      => "What?!?"
);
curl_setopt_array($ch, $options);
$out = curl_exec($ch);
$code = curl_getinfo($ch)["http_code"];
curl_close($ch);

if ($code == "200") {
    echo "200";
} else {
   echo "not 200";
}

Some webservers are slow to reply, and although the page is loaded in my browser after a few seconds my script, when it tries to connect to that server, tells me that it did not receive a positive ("200") reply. So, apparently, the connection initiated by cURL timed out.

But why? I don't set a timeout in my script, and according to other answers on this site the default timeout for cURL is definitely longer than the three or four seconds it takes for the page to load in my browser.

So why does the connecion time out, and how can I get it to last longer, if, apparently, it is already set to infinite?


Notes:

  • The same URL doesn't always time out. So sometimes cURL can connect.
  • It is not one specific URL that sometimes times out, but different URLs at different times.
  • I'm on a shared server, so I don't have root access to any files.
  • I tried to look at curl_getinfo($ch) and curl_error($ch) – as per @drew010's suggestion in the comments – but both were empty whenever the problem happened.
  • The whole script runs for a little more than one minute. In this time it connects to 300+ URLs successfully. Even when one of the URLs fails, the other connections are successfully made. So the script does not time out.
  • cURL does not time out either, because when I try to connect to an URL with a script sleeping for 59 seconds, cURL successfully connects. So apparently the slowness of the failing URL is not a problem in itself for cURL.

Update

Following @Karlos' suggestion in his answer, I used:

CURLOPT_VERBOSE        => 1,
CURLOPT_STDERR         => $curl_log

(using code from this answer) and found the following in $curl_log when an URL failed (URL and IP changed):

* About to connect() to www.somesite.com port 80 (#0)
*   Trying 104.16.37.249... * connected
* Connected to www.somesite.com (104.16.37.249) port 80 (#0)
GET /wp_german/?feed=rss2 HTTP/1.1
User-Agent: myURL
Host: www.somesite.com
Accept: */*

* Recv failure: Connection reset by peer
* Closing connection #0

So, I have found the why – thank you @Karlos! – and apparently @Axalix was right and it is a network problem. I'll now follow suggestions given on this site for that kind of failure. Thanks to everyone for their help!

Community
  • 1
  • 1
  • How long is it taking to timeout? Is it a connection timeout or a socket timeout? – Chris Mar 23 '16 at 13:44
  • @Chris This is a script that connects to about 300 URLs. It usually finishes within a minute or so. I wouldn't know how to check what kind of timeout it is. –  Mar 23 '16 at 18:17
  • 1
    You should dump `curl_getinfo($ch);` to see what the details of the response are. The response code could be empty if it never attempted to connect or had a problem (other than a timeout) during the request. When all else fails `curl_error($ch);` will return an error message too. – drew010 Mar 23 '16 at 22:25
  • To find out all the times you should write timestamps to log file. That way you will be able to find out what is the timeout for failed request. – Ivan Yarych Mar 27 '16 at 10:32
  • @IvanYarych As I explained in my notes above, cURL does not seem to timeout at all! The whole script runs about one minute, connecting to 300+ URLs, and cURL does not time out when connecting to an URL that does not react for 59 seconds, so cURL timing out cannot be the problem, because it fails faster than whatever timeout is set for it. –  Mar 27 '16 at 10:39
  • There could be many reasons, including ones when remote sites just don't want you to scrape them, applying different schema (limits [including by IP], headers, referrers, cookies, etc.). Do you have any specific URL that fails all the time and you could share here? I could give it a try. If not, you are probably experience a problem with limits so all you need to do is just slow down your requests. – Axalix Mar 27 '16 at 16:38
  • @Axalix As I wrote in my question, all of the URLs work most of the time, and some of the URLs fail sometimes. For example, a few minutes ago all blogspot.com URLs failed. Now they all work again. At other times it's other URLs that fail. Most of the time, none fail. –  Mar 29 '16 at 16:09
  • 1
    @what if there's no pattern, then probably you're just facing a network problem. Could be your provider, DNS, etc. Try the same code in a different network and see if you have the same issues. – Axalix Mar 29 '16 at 16:14

2 Answers2

5

My experience working with curl showed me that sometimes when using the option:

CURLOPT_RETURNTRANSFER => true

the server might not give a successful reply or, at least, a successful reply within the timeframe that curl has to receive the response and cache it, so the results are returned by the curl into the variable you assign. In your code:

$out = curl_exec($ch);

In this stackoverflow question CURLOPT_RETURNTRANSFER set to true doesnt work on hosting server, you can see that that the option CURLOPT_RETURNTRANSFER is directly affected by the requested host web server implementation.

As you are using explicitly the response body, and your code relies on the response headers, a good way to solve this might be to:

CURLOPT_RETURNTRANSFER => false

and execute the curl code to work on the response headers.

Once you have the header with the code you are interested, you could run a php script that echoes the curl response and parse it by yourself:

<?php
    $url=isset($_GET['url']) ? $_GET['url'] : 'http://www.example.com';
    $ch= curl_init();
    $options = array(
            CURLOPT_RETURNTRANSFER => false,
            CURLOPT_URL            => $url,
            CURLOPT_USERAGENT      => "myURL"
    );
    curl_setopt_array($ch, $options);
    curl_exec($ch);
    curl_close($ch);
?>

In any case the reply to your question why your request does not get an error, I guess that the use of the option CURLOPT_NOSIGNAL and the different timeout options explained in the set_opt php manual might get you closer to it.

In order to dig further, the option CURLOPT_VERBOSE might help you to have extra information about the request behavior through the STDERR.

Community
  • 1
  • 1
Evhz
  • 8,852
  • 9
  • 51
  • 69
  • Thank you for your kind help! Found the "why" (see the update to my question), and will now try to work from that understanding. –  Mar 29 '16 at 19:18
0

The reason may be your hosting provider is imposing some limits on outgoing connections.

Here is what can be done to secure your script:

  1. Create a queue in DB with all the URLs that need to be fetched.

  2. Run cron every minute or 5 minutes, take a few URLs from DB - mark them as in progress.

  3. Try to fetch those URLs. Mark every fetched URL as success in DB.

  4. Increment failure count for unsuccessful ones.

  5. Continue going through queue until its empty.

If you implement such a solution you will be able to process every single URL under any unfavourable conditions.

Ivan Yarych
  • 1,931
  • 17
  • 15
  • I'm not sure if you read my question correctly. My script does not fail. It finishes. What fails is a connection to one URL among more than 300. And it is not the last URL. My script keeps running after that one URL fails. What I want is **to understand *why* this URL fails** in my script, when it doesn't fail in the browser and when my script can deal with even longer lags (59 seconds) that this URL (3 seconds). –  Mar 27 '16 at 09:40
  • If that's only a single URL that really looks strange. You will probably need to debug it to understand. If you can provide it I could take a look. BTW, did you try to run same script from different server? – Ivan Yarych Mar 27 '16 at 09:50
  • It is not always the same URL, and all the URLs that sometimes fail I can successfully fetch at other times. –  Mar 27 '16 at 10:14
  • So my solution will work. The reasons can be different. If it's not failing the same all the time - it's hard to tell anything without actual debugging of the issue. One last guess is to add timeouts between consecutive fetches. Like fetch 20 URLs, wait 30 seconds, continue. – Ivan Yarych Mar 27 '16 at 10:30
  • Yes, your "solution" will work. But so does my script without your solution! Your "solution" does not solve the problem. All it does is try to connect to the failing URL at another time, and I do that already by running my script as a cronjob every hour. What I want is to understand why the URL fails, and your answer does not provide that understanding. Please read the question! It asks: "**Why ... ?**" in its title. –  Mar 27 '16 at 10:35
  • You provide no info and want the answer. Read the first sentence of my answer. Good luck – Ivan Yarych Mar 27 '16 at 10:42
  • Then please explain to me how I can find out if there are such limits. As I already explained a few times to you, cURL can connect to a slow URL. So there is no time limit. It can also connect to the failing URL at another time. So there is no problem with connecting to that URL. So what kind of limit can you envision, and how can I verify this? I read everything your wrote, but it does not apply to my situation! –  Mar 27 '16 at 10:45
  • The logic is simple. There are N connections allowed for T period of time. N and T can be anything. So I am proposing making timeouts (sleep) between consecutive fetches like I mentioned in my last comment: fetch 20 URLs, wait 30 seconds, continue. If that doesn't help - try different URL counts/delays – Ivan Yarych Mar 27 '16 at 11:00
  • That cannot be the problem, because sometimes it is the first URL that fails while the following 300 connect fine. –  Mar 27 '16 at 11:34