Can't seem to get a web page's contents via cURL - user agent and HTTP headers both set?

Question

For some reason I can't seem to get this particular web page's contents via cURL. I've managed to use cURL to get to the "top level page" contents fine, but the same self-built quick cURL function doesn't seem to work for one of the linked off sub web pages.

Top level page: http://www.deindeal.ch/

A sub page: http://www.deindeal.ch/deals/hotel-cristal-in-nuernberg-30/

My cURL function (in functions.php)

function curl_get($url) {
    $ch = curl_init();
    $header = array(
    'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7',
    'Accept-Language: en-us;q=0.8,en;q=0.6'
    );
    $options = array(
        CURLOPT_URL => $url, 
        CURLOPT_HEADER => 0, 
        CURLOPT_RETURNTRANSFER => 1, 
        CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13',
        CURLOPT_HTTPHEADER => $header
        );
    curl_setopt_array($ch, $options);
    $return = curl_exec($ch);
    curl_close($ch);

    return $return;
}

PHP file to get the contents (using echo for testing)

require "functions.php";
require "phpQuery.php";

echo curl_get('http://www.deindeal.ch/deals/hotel-walliserhof-zermatt-2-naechte-30/');

So far I've attempted the following to get this to work

Ran the file both locally (XAMPP) and remotely (LAMP).
Added in the user-agent and HTTP headers as recommended here file_get_contents and CURL can't open a specific website - before the function curl_get() contained all the options as current, except for CURLOPT_USERAGENTandCURLOPT_HTTPHEADERS`.

Is it possible for a website to completely block requests via cURL or other remote file opening mechanisms, regardless of how much data is supplied to attempt to make a real browser request?

Also, is it possible to diagnose why my requests are turning up with nothing?

Any help answering the above two questions, or editing/making suggestions to get the file's contents, even if through a method different than cURL would be greatly appreciated ;).

sberry · Accepted Answer · 2011-12-30T02:58:41.857

Try adding:

CURLOPT_FOLLOWLOCATION => TRUE

to your options.

If you run a simple curl request from the command line (including a -i to see the response headers) then it is pretty easy to see:

$ curl -i 'http://www.deindeal.ch/deals/hotel-cristal-in-nuernberg-30/'
HTTP/1.1 302 FOUND
Date: Fri, 30 Dec 2011 02:42:54 GMT
Server: Apache/2.2.16 (Debian)
Vary: Accept-Language,Cookie,Accept-Encoding
Content-Language: de
Set-Cookie: csrftoken=d127d2de73fb3bd72e8986daeca86711; Domain=www.deindeal.ch; Max-Age=31449600; Path=/
Set-Cookie: generic_cookie=1; Path=/
Set-Cookie: sessionid=987b1a11224ecd0e009175470cf7317b; expires=Fri, 27-Jan-2012 02:42:54 GMT; Max-Age=2419200; Path=/
Location: http://www.deindeal.ch/welcome/?deal_slug=hotel-cristal-in-nuernberg-30
Content-Length: 0
Connection: close
Content-Type: text/html; charset=utf-8

As you can see, it returns a 302 with a Location header. If you hit that location directly, you will get the content you are looking for.

And to answer your two questions:

No, it is not possile to block requests from something like curl. If the consumer can talk HTTP then it can get to anything the browser can get to.
Diagnosing with an HTTP proxy could have been helpful for you. Wireshark, fiddler, charles, et al. should help you out in the future. Or, do like I did and make a request from the command line.

EDIT
Ah, I see what you are talking about now. So, when you go to that link for the first time you are redirected and a cookie (or cookies) is set. Once you have those cookie, your request goes through as intended.

So, you need to use a cookiejar, like in this example: http://icfun.blogspot.com/2009/04/php-how-to-use-cookie-jar-with-curl.html

So, you will need to make an initial request, save the cookies, and make your subsequent requests including the cookies after that.

Thanks for the info, and adding `CURLOPT_FOLLOWLOCATION` did work according to the response headers (redirecting to `http://www.deindeal.ch/welcome/?..`), however it's now apparent that the response headers are showing a different location than if you attempt to visit the url in your browser. If I visit the url in my browser, I find myself not redirected - and the url is requested perfectly, however when a cURL request is made, a different location is provided - do you know why this might be? — Avicinnian, Dec 30 '11 at 02:50

Can't seem to get a web page's contents via cURL - user agent and HTTP headers both set?

1 Answers1

Linked