2

For more than 2 years I had paid agreement with another website to be able to get their content via my script using Simple_html_DOM. Now suddenly without any warning and still under contract with them Im getting failed to open stream: Connection timed out no matter what Im using- simple_html_DOM, cURL, file_get_content. I even tried snoopy library to simulate web browser, still getting Connection timed out. They somehow blocking connections. Its not IP blocking as well as I tried from several different servers with same results. Their website is loading fine in my web browser so no problems there. Is there any other way I could get content from that website? As I paid money for it and they blatantly ignoring me after taking my money.

DadaB
  • 762
  • 4
  • 12
  • 29
  • 2
    So you are receiving a "Connection timed out" in all of the tried cases and no response at all? In most cases web hosts perform header checks to see if the accessing client is a web browser i.e by checking the User-Agent. Best approach would be to mimic all request headers your browser sends in a cURL request and see if that still results in a timeout. Also make sure that it's not something else on your side e.g. that you use a proxy in your web browser whereas you are using none in your programmatic tests. – SaschaM78 May 16 '19 at 09:36

2 Answers2

3

The server is probably blocking requests based on (absence of valid) user agent header (User-Agent:). Basically this header self-identifies the to server what it is: a browser, a bot, spider or app etc.

You can try using cURL to send the same kinds of header the server would expect from a typical browser, using curl_setopt and the CURLOPT_USERAGENT option (docs here).

$url = "https://example.com";
// we're going to impersonate Chrome 74 on MacOS in this example.
$user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"; 
$ch = curl_init();
// this is where we set the option to send the user agent header
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);

If that still doesn't work, make sure you don't need cookies or login credentials.

chatnoir
  • 2,185
  • 1
  • 15
  • 17
  • 4
    As an alternative (or the next step in case that fails) you can also open the website in your browser, open developer tools, go to the Network tab, right click the main request that loaded the page, click "Copy as cURL" and run it in your terminal. Then delete headers one by one to find out which one they're blocking. And then use that knowledge to replicate minimum necessary requests in PHP. – Rafał G. May 17 '19 at 19:35
1

If you want to use file_get_content() instead of curl. You can do this:

$options  = array('http' => array('user_agent' => 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'));
$context  = stream_context_create($options);
$response = file_get_contents('http://domain/path/to/uri', false, $context);
RyanNerd
  • 3,059
  • 1
  • 22
  • 28