41

I am trying to make a sitescraper. I made it on my local machine and it works very fine there. When I execute the same on my server, it shows a 403 forbidden error. I am using the PHP Simple HTML DOM Parser. The error I get on the server is this:

Warning: file_get_contents(http://example.com/viewProperty.html?id=7715888) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden in /home/scraping/simple_html_dom.php on line 40

The line of code triggering it is:

$url="http://www.example.com/viewProperty.html?id=".$id;

$html=file_get_html($url);

I have checked the php.ini on the server and allow_url_fopen is On. Possible solution can be using curl, but I need to know where I am going wrong.

absk
  • 643
  • 1
  • 6
  • 15
  • Is the server yours? If so, it appears that you or your hosting service have configured security settings to prevent being scrapped. – Álvaro González Dec 28 '10 at 11:48
  • Its not 'my' server, but its a dedicated server. – absk Dec 28 '10 at 11:51
  • I misread the question. I thought you were scrapping your own site (i.e., a site you have explicit permission to scrape). @Pekka has it right. – Álvaro González Dec 28 '10 at 11:59
  • Voting to close as too broad. If the server returns 403 it means the request is forbidden. If it's not your server, there's no way to know anything beyond that. – miken32 Aug 20 '19 at 21:34

11 Answers11

85

I know it's quite an old thread but thought of sharing some ideas.

Most likely if you don't get any content while accessing an webpage, probably it doesn't want you to be able to get the content. So how does it identify that a script is trying to access the webpage, not a human? Generally, it is the User-Agent header in the HTTP request sent to the server.

So to make the website think that the script accessing the webpage is also a human you must change the User-Agent header during the request. Most web servers would likely allow your request if you set the User-Agent header to an value which is used by some common web browser.

A list of common user agents used by browsers are listed below:

  • Chrome: 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'

  • Firefox: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0

  • etc...


$context = stream_context_create(
    array(
        "http" => array(
            "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
        )
    )
);

echo file_get_contents("www.google.com", false, $context);

This piece of code, fakes the user agent and sends the request to https://google.com.

References:

Cheers!

Ikari
  • 3,176
  • 3
  • 29
  • 34
  • 2
    Either `"header" => "User-Agent: "` or `"user_agent" => ""` [would do](http://php.net/manual/en/context.http.php). – Sz. Mar 27 '18 at 11:06
22

This is not a problem with your script, but with the resource you are requesting. The web server is returning the "forbidden" status code.

It could be that it blocks PHP scripts to prevent scraping, or your IP if you have made too many requests.

You should probably talk to the administrator of the remote server.

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • 2
    But at the same time, it works just fine on my localhost. The problem seems to be with my server config somehow. – absk Dec 28 '10 at 11:49
  • 4
    @absk no, the `403 forbidden` is clearly from the remote server. The connection works fine - try a different IP to verify. It could be that your server's IP is blocked on the remote server's end – Pekka Dec 28 '10 at 11:50
11

Add this after you include the simple_html_dom.php

ini_set('user_agent', 'My-Application/2.5');
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
Vijay Richards
  • 111
  • 1
  • 8
6

You can change it like this in parser class from line 35 and on.

function curl_get_contents($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  $data = curl_exec($ch);
  curl_close($ch);
  return $data;
}

function file_get_html()
{
  $dom = new simple_html_dom;
  $args = func_get_args();
  $dom->load(call_user_func_array('curl_get_contents', $args), true);
  return $dom;
}

Have you tried other site?

Dejan Marjanović
  • 19,244
  • 7
  • 52
  • 66
  • 1
    But how is this supposed to fix a remote 403? – Pekka Dec 28 '10 at 11:52
  • He mentioned cURL, so the first part of answer was for that and second "Have you tried other site?", or he might give us a link to check. I know 403 is remote that is why I am suggesting him to try on other site. – Dejan Marjanović Dec 28 '10 at 11:53
  • So its fetching data from other sites. Seems my IP just got blacklisted. Any way through? – absk Dec 28 '10 at 11:58
  • You can buy another IP, or scrape data from shared hosting to make it less obvious, but they can block other IP's as well. First try from other server and make pauses when you scrape to look more like a normal user. – Dejan Marjanović Dec 28 '10 at 12:05
5

It seems that the remote server has some type of blocking. It may be by user-agent, if it's the case you can try using curl to simulate a web browser's user-agent like this:

$url="http://www.example.com/viewProperty.html?id=".$id;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
curl_close($ch);
Sergi
  • 1,224
  • 3
  • 15
  • 34
3

Write this in simple_html_dom.php for me it worked

function curl_get_contents($url)
{
  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
$html = curl_exec($ch);
  $data = curl_exec($ch);
  curl_close($ch);
  return $data;
}

function file_get_html($url, $use_include_path = false, $context=null, $offset = -1, $maxLen=-1, $lowercase = true, $forceTagsClosed=true, $target_charset = DEFAULT_TARGET_CHARSET, $stripRN=true, $defaultBRText=DEFAULT_BR_TEXT, $defaultSpanText=DEFAULT_SPAN_TEXT)
{
    $dom = new simple_html_dom;
  $args = func_get_args();
  $dom->load(call_user_func_array('curl_get_contents', $args), true);
  return $dom;
    //$dom = new simple_html_dom(null, $lowercase, $forceTagsClosed, $target_charset, $stripRN, $defaultBRText, $defaultSpanText);

}
r0adtr1p
  • 61
  • 8
2

I realize this is an old question, but...

Just setting up my local sandbox on linux with php7 and ran across this. Using the terminal run scripts, php calls php.ini for the CLI. I found that the "user_agent" option was commented out. I uncommented it and added a Mozilla user agent, now it works.

1

Did you check your permissions on file? I set up 777 on my file (in localhost, obviously) and I fixed the problem.

1

You also may need some additional information in the conext, to make the website belive that the request comes from a human. What a did was enter the website from the browser an copying any extra infomation that was sent in the http request.

$context = stream_context_create(
        array(
            "http" => array(
                'method'=>"GET",
                "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) 
                            AppleWebKit/537.36 (KHTML, like Gecko) 
                            Chrome/50.0.2661.102 Safari/537.36\r\n" .
                            "accept: text/html,application/xhtml+xml,application/xml;q=0.9,
                            image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3\r\n" .
                            "accept-language: es-ES,es;q=0.9,en;q=0.8,it;q=0.7\r\n" . 
                            "accept-encoding: gzip, deflate, br\r\n"
            )
        )
    );
Daniel Renteria
  • 365
  • 2
  • 8
0

In my case, the server was rejecting HTTP 1.0 protocol via it's .htaccess configuration. It seems file_get_contents is using HTTP 1.0 version.

Steven
  • 1,214
  • 3
  • 18
  • 28
0

Use below code: if you use -> file_get_contents

$context  = stream_context_create(
  array(
    "http" => array(
      "header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
    )
));

========= if You use curl,

curl_setopt($curl, CURLOPT_USERAGENT,'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36');
sac
  • 97
  • 11