1

I have a website that tracks individual player's data for an online game. Everyday at the same time a cron is run that uses cURL to fetch each player's data from the game company's server (each player requires their own page to fetch). Previously I was looping through each player and creating their own cURL request at a time and storing the data - While this was a slow process, everything was working fine for weeks (doing anywhere from 500-1,000 players everyday).

As we gained more players the cron started to take too long to run so I rewrote it using ParallelCurl (cURL multi handling) about a week ago. It was set to open no more than 10 connections at a time and was running perfectly - doing about 3,000 pages in 3-4 minutes. I never noticed anything wrong until a day or two later I was randomly unable to connect to their servers (returning http code of 0). I thought I was permanently banned/blocked until about 1-2 hours later I could suddenly connect again. The block occurred several hours after the cron had run for the day - the only requests that were being made at the time were the occasional single file requests (that have been working fine and left untouched for months).

The past few days have all been like this. Cron runs fine, then sometime later (a few hours) I can't get a connection for an hour or two. Today I updated the cron to only open 5 connections at a time - everything worked fine until 5-6 hours later I couldn't connect for 2 hours.

I've done a ton of googling and can't seem to find anything useful. I'd guess that possibly a firewall is blocking my connection, but I'm really in over my head when it comes to anything like that. I am really clueless as to what is happening, and what I need to do to fix it. I'd be grateful for any help - even a guess or a just point in the right direction.

Note that I'm using a shared web host (HostGator). 2 days ago I submitted a ticket and made a post on their forums, I also sent an e-mail to the company and have yet to see a single reply from anything.

--EDIT--

Here's my code to run the multiple requests using parallelcurl. The include has been left untouched and is the same as shown here

set_time_limit(0);

require('path/to/parallelcurl.php');

$plyrs = array();//normally an array of all the players i need to update

function on_request_done($content, $url, $ch, $player) {
    $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);    
    if ($httpcode !== 200) {
        echo 'Could Not Find '.$player.'<br />';
        return;
    } else {//player was found, store in db
        echo 'Updated '.$player.'<br />';
    }
}

$max_requests = 5;

$curl_options = array(
    CURLOPT_SSL_VERIFYPEER => FALSE,
    CURLOPT_SSL_VERIFYHOST => FALSE,
    CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.9) Gecko/20071025 Firefox/2.0.0.9',
);

$parallel_curl = new ParallelCurl($max_requests, $curl_options);

foreach ($plyrs as $p) {
    $search_url = "http://website.com/".urlencode($p);
    $parallel_curl->startRequest($search_url, 'on_request_done', $p);
usleep(300);//now that i think about it, does this actually do anything worthwhile positioned here?
}

$parallel_curl->finishAllRequests();

Here's the code I use to simply see if I can connect or not

$ch = curl_init();

$options = array(
    CURLOPT_URL            => $url,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_HEADER         => true,
    CURLOPT_FOLLOWLOCATION => true,
    CURLOPT_ENCODING       => "",
    CURLOPT_AUTOREFERER    => true,
    CURLOPT_CONNECTTIMEOUT => 120,
    CURLOPT_TIMEOUT        => 120,
    CURLOPT_MAXREDIRS      => 10,
    CURLOPT_SSL_VERIFYPEER => false,
    CURLOPT_SSL_VERIFYHOST => false,
);
curl_setopt_array( $ch, $options );
$response = curl_exec($ch); 
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);

print_r(curl_getinfo($ch));

if ( $httpCode != 200 ){
    echo "Return code is {$httpCode} \n"
        .curl_error($ch);
} else {
    echo "<pre>".htmlspecialchars($response)."</pre>";
}

curl_close($ch);

Running that when I'm unable to connect results in this:

Array ( [url] => http://urlicantgetto.com/ [content_type] => [http_code] => 0 [header_size] => 0 [request_size] => 121 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 0 [total_time] => 30.073574 [namelookup_time] => 0.003384 [connect_time] => 0.025365 [pretransfer_time] => 0.025466 [size_upload] => 0 [size_download] => 0 [speed_download] => 0 [speed_upload] => 0 [download_content_length] => -1 [upload_content_length] => 0 [starttransfer_time] => 30.073523 [redirect_time] => 0 ) Return code is 0 Empty reply from server
Capt Otis
  • 1,250
  • 1
  • 12
  • 18

1 Answers1

1

This sounds like it's a network or firewall issue, rather than a PHP/code issue.

Either HostGator is blocking your outbound connections because you have a spike in outbound traffic that could be misinterpreted as a small DOS attack, or the game website is blocking you for the same reason. Especially since this has only started since the number of requests has increased. And also the HTTP status code of 0 suggests firewall behaviour.

Alternatively, perhaps the connections aren't closing properly after the curl requests and later on when you try and load that website or download a file you can't because there are already too many open connections from your server.

If you have SSH access to your server I might be able to help debug if it's the network connections open problem, otherwise you'll need to speak to HostGator and the game website owners to see if either party is blocking you at all.

Another solution might be to scrape the game website slower (introduce a wait time between requests) to avoid being flagged as high network traffic.

Community
  • 1
  • 1
Jon
  • 12,684
  • 4
  • 31
  • 44
  • I finally got in touch with HostGator and while they were little help, they claimed that nothing was being triggered on their end to prevent the connection. Sadly the game company is notorious for their horrible customer service and I'm honestly doubting I'll ever get a reply no matter how many emails I send (going to keep trying though). I get that they might be blocking me as a potential ddos attack, but would it make sense for the block to occur literally hours after the requests were made? I'm also curious to how long you think I should sleep for between requests. 300ms? 1000ms? Thanks. – Capt Otis May 24 '13 at 11:16
  • Yeah the delay doesn't make much sense... just to confirm, you can't connect to the game website from HostGator right? It's not your home/office location that you can't connect from? I'd probably sleep for 500ms, maybe 250ms, but the longer you sleep for the less likely you'll get flagged. – Jon May 24 '13 at 15:42
  • Yes its the server can't connect. 4.5 hours ago I updated the cron with a 300ms sleep between each handle. Ran it once and I've been able to connect with no problems until literally 5 minutes ago - now I can't make a connection again – Capt Otis May 24 '13 at 15:48
  • What's the timeout on making the connection? You might be able to use `curl_getinfo` to find out more info. – Jon May 24 '13 at 16:02
  • Updated the original post with all code that I'm using that deals with the connection. Also included a full print out of curl_getinfo. I'm not currently setting a default timeout as I have yet to see a single parallelcurl script use one - could that possibly be my problem? edit - just became unbanned again, I think almost exactly 1 hour after it started – Capt Otis May 24 '13 at 16:37
  • Also I now think that the sleep I added earlier actually doesn't do anything, where would be a better place for it? – Capt Otis May 24 '13 at 16:42
  • The `usleep(300)` will only sleep for 0.0003 seconds (300 microseconds), it'll need to be `usleep(0.3 * 100000)` to sleep for 300 milliseconds. A timeout might be helpful, but that output you pasted doesn't have a timeout error message, so it definitely seems like you're being blocked somehow. – Jon May 29 '13 at 12:43