29

I have a partner that has created some content for me to scrape.
I can access the page with my browser, but when trying to user file_get_contents, I get a 403 forbidden.

I've tried using stream_context_create, but that's not helping - it might be because I don't know what should go in there.

1) Is there any way for me to scrape the data?
2) If no, and if partner is not allowed to configure server to allow me access, what can I do then?

The code I've tried using:

$opts = array(
  'http'=>array(
    'user_agent' => 'My company name',
    'method'=>"GET",
    'header'=> implode("\r\n", array(
      'Content-type: text/plain;'
    ))
  )
);

$context = stream_context_create($opts);

//Get header content
$_header = file_get_contents($partner_url,false, $context);
Steven
  • 19,224
  • 47
  • 152
  • 257

4 Answers4

43

This is not a problem in your script, its a feature in you partners web server security.

It's hard to say exactly whats blocking you, most likely its some sort of block against scraping. If your partner has access to his web servers setup it might help pinpoint.

What you could do is to "fake a web browser" by setting the user-agent headers so that it imitates a standard web browser.

I would recommend cURL to do this, and it will be easy to find good documentation for doing this.

    // create curl resource
    $ch = curl_init();

    // set url
    curl_setopt($ch, CURLOPT_URL, "example.com");

    //return the transfer as a string
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

    // $output contains the output string
    $output = curl_exec($ch);

    // close curl resource to free up system resources
    curl_close($ch); 
Cleric
  • 3,167
  • 3
  • 23
  • 24
  • I was going to suggest cURL also. I've used it as well. You can set your user-agent to whatever you want, so just choose a common one like IE and you'll likely get past this lockout. – TecBrat Jul 27 '12 at 02:47
  • 1
    @clerick, thanks I will try that. I just have to figure out how to enable `CURL` on my web server - because I get a msg saysing that `curl_init()` is an unknown function. – Steven Jul 27 '12 at 08:40
  • Good luck, and I think this might help you installing cURL http://stackoverflow.com/questions/1347146/how-to-enable-curl-in-php – Cleric Jul 27 '12 at 09:10
31

//set User Agent first

ini_set('user_agent','Mozilla/4.0 (compatible; MSIE 6.0)'); 
Abid Hussain
  • 7,724
  • 3
  • 35
  • 53
  • 1
    I already tried that and it didn't work. But might be because I was not using a recognized agent. – Steven Jul 27 '12 at 08:41
  • 2
    I had a forbidden 403 error when calling `file_get_contents()`, and adding this `ini_set` before my call fixed my problem. – RPDeshaies Nov 04 '14 at 18:47
  • 1
    Thanks. I wanted to use curl first, wich is installed, enabled and getting displayed in my phpinfo but does not define the functions, so i used to use the normal file_get_contents function. Indeed github api requires a browser agent. Thanks for your solution. +1 – Simon Nitzsche Jan 22 '17 at 01:46
  • 2
    `ini_set('user_agent','Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');` is what worked for me. – Kris Peeling May 25 '20 at 23:47
1

Also if for some reason you're requesting a http resource but that resource lives on your server you can save yourself some trouble if you just include the file as an absolute path.

Like: /home/sally/statusReport/myhtmlfile.html
instead of
https://example.org/myhtmlfile.html

0

I have two things in my mind, If you're opening a URI with special characters, such as spaces, you need to encode the URI with urlencode() and A URL can be used as a filename with this function if the fopen wrappers have been enabled.

ARIF MAHMUD RANA
  • 5,026
  • 3
  • 31
  • 58