1

I am trying to scrape a website and I am getting a 403 Forbidden error no matter what I try:

  1. wget
  2. CURL (command line and PHP)
  3. Perl WWW::Mechanize
  4. PhantomJS

I tried all of the above with and without proxies, changing user-agent, and adding a referrer header.

I even copied the request header from my Chrome browser and tried sending with my request using PHP Curl and I am still getting a 403 Forbidden error.

Any input or suggestions on what is triggering the website to block the request and how to bypass?

PHP CURL Example:

$url ='https://www.vitacost.com/productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands%3aquadblock%3asupplements&scrolling=true&No=40&_=1510475982858';
$headers = array(
            'accept:application/json, text/javascript, */*; q=0.01',
            'accept-encoding:gzip, deflate, br',
            'accept-language:en-US,en;q=0.9',               
            'referer:https://www.vitacost.com/productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands:quadblock:supplements',
            'user-agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36',
            'x-requested-with:XMLHttpRequest',
);

$res = curl_get($url,$headers);
print $res;
exit;

function curl_get($url,$headers=array(),$useragent=''){ 
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($curl, CURLOPT_HEADER, true);           
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);   
    curl_setopt($curl, CURLOPT_ENCODING, '');            
    if($useragent)curl_setopt($curl, CURLOPT_USERAGENT,$useragent);             
    if($headers)curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);

    $response = curl_exec($curl);       

    $header_size = curl_getinfo($curl, CURLINFO_HEADER_SIZE);
    $header = substr($response, 0, $header_size);
    $response = substr($response, $header_size);


    curl_close($curl);  
    return $response;
 }

And here is the response I always get:

<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access     

  "http&#58;&#47;&#47;www&#46;vitacost&#46;com&#47;productResults&#46;aspx&#63;" 
on this server.<P>
Reference&#32;&#35;18&#46;55f50717&#46;1510477424&#46;2a24bbad
</BODY>
</HTML>
user735247
  • 113
  • 1
  • 7
  • you are setting user agent wrong way... should have used `CURLOPT_USERAGENT` option – Flash Thunder Nov 12 '17 at 09:11
  • @FlashThunder, The "CURLOPT_USERAGENT" option is there and it is getting set when I send the $useragent variable. I've tried setting the useragent both ways, using the header and "CURLOPT_USERAGENT". I don't think that has anything to do why this is not working. – user735247 Nov 12 '17 at 22:43

1 Answers1

2

First, note that the site does not like web scraping. As @KeepCalmAndCarryOn pointed out in a comment this site has a /robots.txt where it explicitly asks bots to not crawl specific parts of the site, including the parts you want to scrape. While not legally binding a good citizen will adhere to such request.

Additionally the site seems to employ explicit protection against scraping and tries to make sure that this is really a browser. It looks like the site is behind the Akamai CDN, so maybe the anti-scraping protection is from this CDN.

But I've took the request sent by Firefox (which worked) and then tried to simplify it as much as possible. The following works currently for me, but might of course fail if the site updates its browser detection:

use strict;
use warnings;
use IO::Socket::SSL;

(my $rq = <<'RQ') =~s{\r?\n}{\r\n}g;
GET /productResults.aspx?allCategories=true&N=1318723&isrc=vitacostbrands%3aquadblock%3asupplements&scrolling=true&No=40&_=151047598285 HTTP/1.1
Host: www.vitacost.com
Accept: */*
Accept-Language: en-US
Connection: keep-alive

RQ

my $cl = IO::Socket::SSL->new('www.vitacost.com:443') or die;
print $cl $rq;
my $hdr = '';
while (<$cl>) {
    $hdr .= $_;
    last if $_ eq "\r\n";
}
warn "[header done]\n";
my $len = $hdr =~m{^Content-length:\s*(\d+)}mi && $1 or die "no length";
read($cl,my $buf,$len);
print $buf;

Interestingly, if I remove the Accept header I get a 403 Forbidden. If I instead remove the Accept-Language it simply hangs. And also interestingly it does not seem to need a User-Agent header.

EDIT: it looks like the bot-detection also uses the source IP of the sender as feature. While the code above works for me from two different systems it fails to work for a third system (hosted at Digitalocean) and just hangs.

Steffen Ullrich
  • 114,247
  • 10
  • 131
  • 172
  • If you look at https://www.vitacost.com/robots.txt you will see that they don't want scraping of productResults - this is all advisory but its good to be a good citizen https://en.wikipedia.org/wiki/Robots_exclusion_standard – KeepCalmAndCarryOn Nov 12 '17 at 10:31
  • 1
    @KeepCalmAndCarryOn: thanks for the input. I've edited the answer to include the essence of it. – Steffen Ullrich Nov 12 '17 at 10:46
  • @SteffenUllrich, at this point i'm really interested in figuring out why my browser request works while all the other tools I use do not, even though I am sending the same headers. I tried running your code on my server and my personal computer and it just hangs. You're correct that removing the Accept header causes a 403 Forbidden but I still can't get the server to return a response. Could this have to do with the SSL certificate? – user735247 Nov 12 '17 at 22:40
  • @user735247: This has nothing to do with the certificate. I rather suggest that the source-IP of the request is included in the bot-detection. It works from two system I've tried but it fails from the third system, which is a DigitalOcean server. If you've tried the code from the system where you are successfully accessing the site with the browser it will probably succeed too. – Steffen Ullrich Nov 13 '17 at 05:06
  • About hanging... has nothing to do with IP... same ip on Debian Lynx - hangs, on Windows Opera - works – Flash Thunder Nov 13 '17 at 10:25
  • @FlashThunder: A hanging can have different causes and source IP is only one of these. If I use my code from the answer it works on two machines but hangs on the Digitalocean machine. If I modify the code and remove the `Accept-Language` it hangs also on machines where it worked before. This behavior is also described in my answer. – Steffen Ullrich Nov 13 '17 at 11:04
  • I read a comment somewhere suggesting that the order of the headers might also be a factor. Can someone confirm by trying to switch the order of the request headers? None of the machines I have access to seem to be able to get a response from the server, so I cannot try it. – user735247 Nov 14 '17 at 02:35
  • @user735247: The order might in theory matter. But I've tried some permutations and they don't seem to matter in this specific case against the current bot-detection. – Steffen Ullrich Nov 14 '17 at 05:50