2

I would like the retrieve the content of a website but the website is build with a exclamation mark in the URL and this doesn't seem to work.

Things I tried:

<?php
echo file_get_contents('https://domain.com/path/!weird.formatted?url=1');
echo file_get_contents('https://domain.com/path/%21weird.formatted?url=1');
echo file_get_contents(urlencode('https://domain.com/path/!weird.formatted?url=1'));
echo file_get_contents(rawurlencode('https://domain.com/path/!weird.formatted?url=1'));

I also tried to retrieve the content with PHP Curl but here seems the exclamation mark also a problem.

So how do I retrieve this webpage? Any suggestions would be highly appreciate.

Update

The URL I try to retrieve the content from: https://loket.bunnik.nl/mozard/!suite86.scherm0325?mPag=1070

Arek van Schaijk
  • 1,432
  • 11
  • 34
  • 2
    does it work when you use the `%21` escape with `file_get_contents()`? – Dom Weldon Jan 05 '16 at 20:40
  • @DomWeldon Look in the examples. No it does not work. Probably because file_get_contents() does translate encoded characters back to normal. – Arek van Schaijk Jan 05 '16 at 20:41
  • Does it throw an error? In what way does it not work? – Dom Weldon Jan 05 '16 at 20:45
  • 2
    Using `urlencode()` within `file_get_contents()` should work. It fact, it is [recommended by the documentation](http://php.net/manual/en/function.file-get-contents.php): *If you're opening a URI with special characters, such as spaces, you need to encode the URI with `urlencode()`.*. Check your server's error log(s)—what does it say? – Terry Jan 05 '16 at 20:45
  • There are no errors produced. file_get_contents() and php's curl are just returning a empty string. I can retrieve the content of any other website without exclamation marks inside the url. Only wget returns some kind of error on the command line (as described OT). – Arek van Schaijk Jan 05 '16 at 20:47
  • what do u mean by **and this doesn't seem to work.**? Please lets have the php error while using `file_get_contents` function – Peyman Mohamadpour Jan 05 '16 at 20:47
  • There is no error. Even curl_error() does not produce any output. – Arek van Schaijk Jan 05 '16 at 20:48
  • Here is the url so you can try it yourself guys: https://loket.bunnik.nl/mozard/!suite86.scherm0325?mPag=1070 – Arek van Schaijk Jan 05 '16 at 20:49
  • 2
    Bash error is caused by using `"double quotes"`, try `'single quotes'` instead. – Niet the Dark Absol Jan 05 '16 at 20:52
  • @NiettheDarkAbsol you're right. But still wget isn't the solution for me. It should work with curl and file_get_contents(). – Arek van Schaijk Jan 05 '16 at 20:54
  • I get errors. Either 404 or redirect errors. – Twisty Jan 05 '16 at 20:58
  • I tested it with other sites with exclamation marks inside the URL and it just doesn't work for me when the exclamation mark is inside the path. – Arek van Schaijk Jan 05 '16 at 21:02
  • I think this is an SSL issue. Similar issue: http://stackoverflow.com/questions/2880169/how-to-get-contents-of-site-use-https suggests having OpenSSL Extension enabled. – Twisty Jan 05 '16 at 21:02
  • 1
    There error, for anyone who doesn't get it: `Warning: file_get_contents(https://loket.bunnik.nl/mozard/!suite86.scherm0325?mPag=1070): failed to open stream: Redirection limit reached, aborting` – Derek Pollard Jan 05 '16 at 21:07
  • @DerekPollard I was getting the same. This suggests it's not a issue with the URL itself but how the web server is handling the request. – Twisty Jan 05 '16 at 21:11
  • See my answer below :-) – Derek Pollard Jan 05 '16 at 21:11
  • @DerekPollard I did. A bit overkill, but yes cURL is the way to go, giving it a UserAgent should be enough. The site may want to create a cookie but it should not require one. – Twisty Jan 05 '16 at 21:13
  • cURL is always overkill lol. – Derek Pollard Jan 05 '16 at 21:15

1 Answers1

2

So the problem is that the web page was checking for a valid user agent/cookie. The code I used to fix the issue:

<?php
    echo getPage("https://loket.bunnik.nl/mozard/!suite86.scherm0325?mPag=1070");

    function getPage ($url) {


    $useragent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36';
    $timeout= 120;
    $dir            = dirname(__FILE__);
    $cookie_file    = $dir . '/cookies/' . md5($_SERVER['REMOTE_ADDR']) . '.txt';

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie_file);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true );
    curl_setopt($ch, CURLOPT_ENCODING, "" );
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true );
    curl_setopt($ch, CURLOPT_AUTOREFERER, true );
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout );
    curl_setopt($ch, CURLOPT_TIMEOUT, $timeout );
    curl_setopt($ch, CURLOPT_MAXREDIRS, 10 );
    curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
    curl_setopt($ch, CURLOPT_REFERER, 'http://www.google.com/');
    $content = curl_exec($ch);
    if(curl_errno($ch))
    {
        echo 'error:' . curl_error($ch);
    }
    else
    {
        return $content;        
    }
        curl_close($ch);

    }
?>
Derek Pollard
  • 6,953
  • 6
  • 39
  • 59