Prevent to Be Crawled by a Script

Question

I was trying to read a page from the same site using PHP. I came across this good discussion and decided to use the cURL method suggested:

function get_web_page( $url )
{
    $options = array(
        CURLOPT_RETURNTRANSFER => true,     // return web page
        CURLOPT_HEADER         => false,    // don't return headers
        CURLOPT_FOLLOWLOCATION => true,     // follow redirects
        CURLOPT_ENCODING       => "",       // handle all encodings
        CURLOPT_AUTOREFERER    => true,     // set referer on redirect
        CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
        CURLOPT_TIMEOUT        => 120,      // timeout on response
        CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
    );

    $ch      = curl_init( $url );
    curl_setopt_array( $ch, $options );
    $content = curl_exec( $ch );
    $err     = curl_errno( $ch );
    $errmsg  = curl_error( $ch );
    $header  = curl_getinfo( $ch );
    curl_close( $ch );

    $header['errno']   = $err;
    $header['errmsg']  = $errmsg;
    $header['content'] = $content;
    return $header;
}

//Now get the webpage
$data = get_web_page( "https://www.google.com/" );

//Display the data (optional)
echo "<pre>" . $data['content'] . "</pre>";

So, for my case, I called the get_web_page like this:

$target_url = "http://" . $_SERVER['SERVER_NAME'] . "/press-release/index.html";           
$page = get_web_page($target_url);

The thing that I couldn't fathom is it worked on all of my test servers but one. I've verified that the cURL is available on the server in question. Also, setting `$target_url = "http://www.google.com" worked fine. So, I am pretty positive that the culprit has nothing to do with the cURL library.

Can it be because some servers block themselves from being "crawled" by this type of script? Or, maybe I just missed something here?

Thanks beforehand.

Similar questions:

@Jack Maney: Nope, my script / page just keeps "trying" (the browser looks busy loading something) until timed out. — moey, Dec 20 '11 at 09:44

score 2 · Answer 1 · answered Dec 20 '11 at 09:58

2

$target_url = "http://" . $_SERVER['SERVER_NAME'] . "/press-release/index.html";

I not sure the above expression is actually return the correct URL for you,
this might the cause of all problem.

Can it be because some servers block themselves from being "crawled" by this type of script?

Yes, it could be.
But I don't have the answer, because you did not put in the implementation details.
This is your site, you should able to check.

In a general, I would say this is a bad idea,
if you are trying to access another page from the same domain,
you can just simply do file_get_contents(PATH_TO_FILE.'/press-release/index.html');
(judge by the extension HTML, I assume that is static page)

If that page is require some PHP processing,
well, you just need to prepare all the necessary variables ... then require the file.

answered Dec 20 '11 at 09:58

ajreal

46,720
11
89
119

Thanks for the input (+1)! The page that I was trying to include can be static and dynamic. The latter actually is a page "hosted" on a WordPress blog (same server, though) e.g. `http:///blog/category/`. So, I need a way to trigger that page as if it was viewed by a browser; hence, the `cURL` library. – moey Dec 20 '11 at 10:21
What options do we have to check whether the site is in fact "crawl-able"? – moey Dec 20 '11 at 10:34
There is no option for you to check until you really curl the page. However, you can set a timeout http://www.php.net/manual/en/function.curl-setopt.php – ajreal Dec 20 '11 at 10:45
Running `curl http:///press-release/index.html` from the command line works. So, it's confirmed that the page indeed can be curl-ed. – moey Dec 20 '11 at 14:38
Yes, unless with some weird configuration – ajreal Dec 20 '11 at 14:47

score 0 · Answer 2 · answered Dec 21 '11 at 01:07

0

Try using HTTP_HOST instead of SERVER_NAME. They're not quite the same.

answered Dec 21 '11 at 01:07

a sad dude

2,775
17
20

score 0 · Accepted Answer · edited May 23 '17 at 12:20

It turned out that there's nothing wrong with the above script. And yes, $target_url = "http://" . $_SERVER['SERVER_NAME'] . "/press-release/index.html"; returned the intended value (as questioned by @ajreal in his answer).

The problem was actually due to how the IP (of the target page) was being resolved, which makes the answer to this question not related to PHP nor Apache: when I ran the script on the server under test, the returned IP address wasn't accessible. Please refer to this more detailed explanation / discussion.

One take away: please first try curl -v from the command line, which might give you useful clues.

Prevent to Be Crawled by a Script

3 Answers3