1

i have a code that is working

$html = file_get_contents('https://www.imdb.com/');
echo $html;

also this code working too

$html = file_get_contents('https://www.google.com/');
echo $html;

but its not working with some urls like this one:

$html = file_get_contents('https://www.rottentomatoes.com/');
echo $html;

and i get this error

Warning: file_get_contents(https://www.rottentomatoes.com/tv/friends): failed to open stream: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

and i don't understand why this happening? and website loads in browser with no problem and no need for vpn (but some other url might need).

i also used Simple HTML DOM Parser and in 1.9 version i get this error and in 2.0RC2 version i get empty $html and NULL value.

can someone help me please?

Ramiel
  • 33
  • 1
  • 8
  • possible [duplicate](https://stackoverflow.com/questions/17363545/file-get-contents-is-not-working-for-some-url) – berend Nov 03 '20 at 10:17
  • 1
    @berend its kinda similar but their answer based on luck! – Ramiel Nov 03 '20 at 10:22
  • Web scraping can be a complex task depending on the site and the solutions it implements to prevent scraping, among possible cases: some sites may need some cookies to work as a protection (possible to enable them with curl), they can also check if the request comes from a browser, allow only some methods (eg POST or GET), or if they render the page with a JS framework you could have an incomplete result.. adding details could help people to properly answer you – Kaddath Nov 03 '20 at 10:40
  • 1
    @Kaddath thank you for useful information. before going forward i kinda wonder why this exact same code works for my friend in another country and not 1 every time he runs it and not working for me? maybe depends on internet connection? i don'n know – Ramiel Nov 03 '20 at 10:46
  • @Ramiel a good start would be to check the request and response headers, and compare them to your friend's ones, both with the same browser and from the script, try to find what differs. Maybe try a google search with keywords such as "webscraping rottentomatoes" to see if people already solved this for you. It can take a lot of trial and error but maybe someone did it already and shared the knowledge on the web – Kaddath Nov 03 '20 at 10:57
  • 1
    @Kaddath thank you i'll try that but not much information about webscraping rottentomatoes – Ramiel Nov 03 '20 at 11:02

1 Answers1

1

You can't use file_get_contents() function on any website! Currently the 2nd website https://www.rottentomatoes.com/ is refusing your connection

Please read more on how to use file_get_contents

Burhan Kashour
  • 674
  • 5
  • 15
  • 1
    so what about simple-html-dom? i also tested php guzzle and how can i fix this problem? – Ramiel Nov 03 '20 at 10:19
  • Its not about what plugin you use, Every website can allow or deny any connection from any type of scripts, you can try the [cURL Functions](https://www.php.net/manual/en/ref.curl.php) – Burhan Kashour Nov 03 '20 at 10:21
  • I fear that it only very vaguely answers the question (or not at all), there is no explanation of the possible reason why the connexion is refused (if it is even true, as the error points rather to a timeout than a refused connexion), and the link to the doc doesn't address this problem either – Kaddath Nov 03 '20 at 10:25
  • 1
    but this is kinda weird, i have a friend and he just copy past my code and its works for him how they not block his request! – Ramiel Nov 03 '20 at 10:27
  • @Kaddath Well, timeout means alot of things, at this point when I test the function on my side, the site is refusing the connection but not returning any response, so `timeout` here means `rejected`. – Burhan Kashour Nov 03 '20 at 10:27
  • 1
    but he is in a different country – Ramiel Nov 03 '20 at 10:28
  • @Ramiel Maybe It will work once, or 1 in many times, but `file_get_contents` its not the proper way to achieve this, working with links is a bit tricky, `cUrl` is more useful as you can control the headers, functions, timeouts...etc – Burhan Kashour Nov 03 '20 at 10:28
  • I'm not downvoting or flagging your answer, just saying that it could be improved ;) – Kaddath Nov 03 '20 at 10:30
  • @Kaddath Well, downvoting or flagging doesn't change the fact that `file_get_contents` is not useful all the time and you should be using `cURL` – Burhan Kashour Nov 03 '20 at 10:31
  • 1
    @BurhanKashour as i try everything i know and then ask my question here. i also tested `cUrl` and set up the header etc stuff but that didn't work either – Ramiel Nov 03 '20 at 10:32
  • 1
    @BurhanKashour i don't remember really maybe this error. but i will test again – Ramiel Nov 03 '20 at 10:37
  • 1
    @BurhanKashour yeah same error, and even i used user agent – Ramiel Nov 03 '20 at 10:52
  • @Ramiel Then the problem with the website, the website as I said is refusing any connection from scripts – Burhan Kashour Nov 03 '20 at 10:58
  • 1
    @BurhanKashour is it possible to make this works for these kinda website (worth to work on it?) or at end there is no way? – Ramiel Nov 03 '20 at 11:06
  • 1
    Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/224036/discussion-between-burhan-kashour-and-ramiel). – Burhan Kashour Nov 03 '20 at 13:34
  • 1
    @BurhanKashour that sounds perfect – Ramiel Nov 03 '20 at 15:56