-3

I'm trying to build a basic web scraper. It works fine for almost any website however some sites I'm unable to scrap, why is this? Here is my code on a site that works (this site):

<!doctype html>
<html lang="en-US">
  <body>
    <?php
      $url ='http://stackoverflow.com/';
      $output = file_get_contents($url);
      echo $output;
    ?>
  </body>
</html>

When run on my own local host this outputs the content of stackoverflow.com into my site. Here is a site this doesn't work for:

<!doctype html>
<html lang="en-US">
  <body>
    <?php
      $url ='https://www.galottery.com/en-us/home.html';
      $output = file_get_contents($url);
      echo $output;
    ?>
  </body>
</html>

Instead of loading the site I get this error:

Warning: file_get_contents(https://www.galottery.com/en-us/home.html): failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden in C:\xampp\htdocs\projects\QD\webScraping\index.php on line 6

Why does this work for some sites and not for others? I thought this could be because one is a HTTPS site but I've tried this code for others like https://google.com and it works just fine.

I'm using XAMMP to run local PHP.

Martin
  • 22,212
  • 11
  • 70
  • 132
Lynel Hudson
  • 2,335
  • 1
  • 18
  • 34
  • 3
    They've quite possibly got something on their server to stop people scraping their sites – andrewsi Jul 15 '16 at 14:18
  • have you tried to access the webpage from a brower? If you can't access it then you have been blocked from the site – Rafael Shkembi Jul 15 '16 at 14:19
  • The remote site is blocking requests based on some policy which we can't possibly know. Perhaps by the lack of user-agent or similar. By the way, if you're going to use this sort of tactic on a public website be sure to acquire relevant permissions otherwise you may end up in a legal situation – apokryfos Jul 15 '16 at 14:19
  • Could using a different method than this help? – Lynel Hudson Jul 15 '16 at 14:19
  • 1
    403 Forbidden says what it says :) the website does not want your scrapper to be there. It can be an htaccess protection for example. Sometime you can pass this kind of protection playing with useragent (see here for example : http://stackoverflow.com/a/2107792/6347483) – jquiaios Jul 15 '16 at 14:20
  • You should also not be wrapping your code with any `html`, the scraped page will contain all of your `html` and body `tags` – cmorrissey Jul 15 '16 at 14:26

2 Answers2

2

Either they are checking UserAgent, either they are forbide your IP-address.

To simulate correct UserAgent, you must use curl, like this:

$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);

curl_setopt($ch, CURLOPT_URL, "https://www.galottery.com/en-us/home.html");
$result = curl_exec($ch);

echo $result;

Although, they may use some javascript-redirection, like. first you load web-page, they are setting a cookie and perform document.location.href redirect. than they are checking that cookie.

UPDATE: just tested, my solution works just fine.

spirit
  • 3,265
  • 2
  • 14
  • 29
  • It should be noted that many people consider this a rude move. – ceejayoz Jul 15 '16 at 14:43
  • @ceejayoz, Really? Why? I simulate work of my own browser so what's rude in this? – spirit Jul 15 '16 at 14:49
  • Because scraping a site is generally a violation of its terms of service, and the scraping restrictions are usually there for a good reason. – ceejayoz Jul 15 '16 at 14:56
  • I get what you mean, but I think it's a violation only if I'm using other site for some illegal purposes only. Don't you think? – spirit Jul 15 '16 at 15:00
  • Great post but I don't have curl. – Lynel Hudson Jul 15 '16 at 15:01
  • @solacyon These days not having curl is basically a misconfigured server. It's all but a requirement for modern web development - anything that talks to APIs is going to expect it. – ceejayoz Jul 15 '16 at 15:06
  • @spirit No, even if it's a legal purpose it can put undue burden on the site. As someone who maintains websites, I'd like it to be my choice what traffic I permit, not yours. – ceejayoz Jul 15 '16 at 15:07
  • @ceejayoz, I get your point and I think I'm agree with you =). BTW, I'm not doing any web-scraping =). – spirit Jul 15 '16 at 15:11
2

It's work;

<?php

$ops =  array(
    'http' => array(
        'method' => "GET",
        'header' => "Accept-language: en\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\r\n" .
                    "Cookie: foo=bar\r\n" . 
                    "User-Agent: Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:10\r\n"
    )
);

$context = stream_context_create($ops);

echo file_get_contents('https://www.galottery.com/en-us/home.html', false, $context);
spirit
  • 3,265
  • 2
  • 14
  • 29
S. Denis
  • 149
  • 3
  • 11