0

I'm trying to get a specific webpage using php file_get_contents - when I view the page directly there is no problem but when trying to grab it using php I get "failed to open stream: HTTP request failed! HTTP/1.1 403 Forbidden". Theres a piece of data that I'm trying to extract from the page.

$ft = file_get_contents('https://www.vesselfinder.com/vessels/CELEBRITY-MILLENNIUM-IMO-9189419-MMSI-249055000');

echo $ft;

I've read up on various pages here about using stream_context_create, mainly the user agent part

$context  = stream_context_create(
array(
    "http" => array(
        "header" => "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
    )
)

);

But nothing works and I now get a 400 error message. Unfortunately it doesn't look like my server is configured to use cURL so file_get_contents seems to be the only way for me to do this.

user3713442
  • 478
  • 8
  • 22
  • No. It's called the [Same-origin policy](https://en.wikipedia.org/wiki/Same-origin_policy). – icecub Dec 02 '17 at 19:39
  • Have you tried this: https://stackoverflow.com/questions/2107759/php-file-get-contents-and-setting-request-headers – macghriogair Dec 02 '17 at 19:43
  • @icecub there are plenty of possible reasons for a 403, but essentially the remote server is saying the requestor doesn't have permission to access the resource. However, same-origin is a browser-only thing AFAIK, and wouldn't apply to requests made from a PHP app. In fact the article you linked to makes it quite clear it's a browser-specific thing. I don't think it's related to this issue. – ADyson Dec 02 '17 at 19:45
  • @ADyson I agree there can be other reasons for a 403. But no matter how you look at it, the error clearly states _You're not allowed to do this_. Or _You're not allowed to this the way you're doing it now_. For me, it always came down to enabling CORS on the server where the request is made and the issue was solved. Hence I always tend to look at that first. – icecub Dec 02 '17 at 19:50
  • @macghriogair - yes I have. – user3713442 Dec 02 '17 at 19:52
  • 1
    @icecub but CORS _only_ applies to ajax requests made from the browser. https://developer.mozilla.org/en-US/docs/Web/HTTP/CORS – ADyson Dec 02 '17 at 19:52
  • 1
    @ADyson That can be true. I work a lot with Ajax so to be fair, I never realised it only applied to it. Perhaps I'm wrong here. I make mistakes just as well. That's why I make a comment and not an answer. – icecub Dec 02 '17 at 19:55
  • @icecub No problem. I wasn't trying to be critical, just didn't want the OP to go off down the wrong line of enquiry. – ADyson Dec 02 '17 at 19:58
  • 1
    If the same-origin policy applied to HTTP requests, wouldn't it mean that it would be impossible for us to navigate to that page on our browsers ? – Hassan Dec 02 '17 at 19:58
  • @user3713442 a 403 response means "Forbidden" and simply means the server won't allow you access to the resource at that URL. If you believe you ought to be able to access that URL, then perhaps you need to make your request in a specific way. Check the API maintainer's documentation. – ADyson Dec 02 '17 at 19:59
  • @ADyson - Most servers will need a valid `User-Agent` header for a get request. The above code is not setting such a header and hence the problem. – Cyclonecode Dec 02 '17 at 20:00
  • 1
    @Cyclonecode It's potentially the issue, but I think "most" is a bit subjective. Depends on their policy, and whether it's meant to be an API or a browser-based UI. If it's meant to be a browser-based UI page then accessing it via a PHP script probably isn't a great solution to the OP's problem. – ADyson Dec 02 '17 at 20:02
  • I tried to bypass the 403 response by sending a user agent with file_get_contents but to no avail. I tested it beforehand (on netcat) to make sure that the request is properly formatted. It worked with cURL on the terminal, but not with file_get_contents. Did your code work ? Edit : my bad, I was testing a different file_get_contents call this whole time. – Hassan Dec 02 '17 at 20:07
  • @ADyson I know. I rather have you call me out on my mistakes than just leave it at that. Can only learn from it :) – icecub Dec 02 '17 at 20:19

2 Answers2

8

You need to add the User-Agent header to the actual header:

$context  = stream_context_create(
  array(
    'http' => array(
      'header' => 'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
    ),
));

You could also use the user_agent option:

$context = stream_context_create(
  array(
    'http' => array(
      'user_agent' => 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
    ),
));

Both above examples should work and you should now be able to get the contents using:

$content = file_get_contents('https://www.vesselfinder.com/vessels/CELEBRITY-MILLENNIUM-IMO-9189419-MMSI-249055000', false, $context);

echo $content;

This could of course also be tested using curl from the command line. Notice that we are setting our own User-Agent header:

curl --verbose -H 'User-Agent: YourApplication/1.0' 'https://www.vesselfinder.com/vessels/CELEBRITY-MILLENNIUM-IMO-9189419-MMSI-249055000'

It might also be worth knowing that the default User-Agent used by curl seems to be blocked, so if using curl you need to add your own using the -H flag.

Cyclonecode
  • 29,115
  • 11
  • 72
  • 93
  • This works. However it works now, because you are faking the user agent header to not be recognized as a bot / script. – macghriogair Dec 02 '17 at 20:01
  • @macghriogair - Yes of course this is the reason. But this should work even with a `User-Agent` header such as `User-Agent: MyApplication/1.0`, the reason that the request is failing is because the server won't allow a request lacking this header. – Cyclonecode Dec 02 '17 at 20:02
  • maybe but "MyApplication/1.0" gives a status 400. so at least they seem to expect some known browser agent string. – macghriogair Dec 02 '17 at 20:05
  • @macghriogair - If you look at a ordinary curl request. Curl with per standard add a User-Agent: curl/ header. – Cyclonecode Dec 02 '17 at 20:05
  • @macghriogair - It works perfectly for me with a User-Agent: MyApplication/1.0, so I think you made a mistake there =) – Cyclonecode Dec 02 '17 at 20:06
  • @macghriogair - Try how curl does it using the verbose flag: curl --verbose -H 'MyApplication/1.0' 'https://www.vesselfinder.com/vessels/CELEBRITY-MILLENNIUM-IMO-9189419-MMSI-249055000' – Cyclonecode Dec 02 '17 at 20:08
  • @macghriogair - Strange thing is that if you use the default curl User-Agent like `curl/7.5.1` then it will **not** work? In fact using any User-Agent header like `curl/` does **not** seem to work, so this kind of agent must be blocked. – Cyclonecode Dec 02 '17 at 20:12
  • 1
    can confirm this. i suppose the site blocks certain headers such as `curl/*` – macghriogair Dec 02 '17 at 20:14
  • 1
    Thanks, this works fine. I'm mystified because I tried a user-agent before and it didn't work, so I guess theres something specific about the string. Anyhow, thanks again. – user3713442 Dec 03 '17 at 20:38
0

Vesselfinder, the service you are making the request to, seems to deny automatic parsing of their data, as @ADyson said. Read the docs: https://www.vesselfinder.com/de/realtime-ais-data#rt-web-services You may ask them for an API token, maybe it is a paid plan.

They have an official API. You need an Api key.

macghriogair
  • 1,431
  • 10
  • 8