0

In php, i'm looking to scrap some urls with file_get_contents.

For most of urls, it's working, but for some urls, like walmart.com, buybuybaby.com.

The source code is quiet simple, but there is a trick to extract those kind of urls (walmart.com ...) ??

I tried with file_get_contents, and also with curl, but still not working

thank you by advance for any help

$url="http://www.buybuybaby.com/";
$homepage = file_get_contents($url);
echo $homepage;

the error : Warning: file_get_contents(https://www.buybuybaby.com/): failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request

  • The most common (basic) “check” used to reject requests from bots is to check if the User-Agent header matches that of an actual browser. // But if those sites have that kind of measures in place already, then likely they _don’t want you_ to scrape their content in the first place. – CBroe Feb 18 '16 at 14:31
  • @CBroe - that's not what's happening here. The server is simply choking on the request. – pguardiario Feb 19 '16 at 01:34
  • @CBroe: It's easy to test. If you send 'I am a robot' as the user agent you will get a good response. – pguardiario Feb 19 '16 at 22:19
  • thanks for your answers ! – user3392106 Feb 22 '16 at 08:36

1 Answers1

0

you should using curl instead of

function curl_get_content($url, $post = "", $refer = "", $usecookie = false)
{
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);

    if ($post) {
        curl_setopt($curl, CURLOPT_POST, 1);
        curl_setopt($curl, CURLOPT_POSTFIELDS, $post);
    }

    if ($refer) {
        curl_setopt($curl, CURLOPT_REFERER, $refer);
    }

    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/6.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.7) Gecko/20050414 Firefox/1.0.3");
    curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
    //curl_setopt($curl, CURLOPT_TIMEOUT_MS, 5000);

    if ($usecookie) {
        curl_setopt($curl, CURLOPT_COOKIEJAR, $usecookie);
       curl_setopt($curl, CURLOPT_COOKIEFILE, $usecookie);
    }

    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);

    $html = curl_exec($curl);
    if (curl_error($curl)) {
       echo 'Loi CURL : ' . (curl_error($curl));
    }
    curl_close($curl);
    return $html;
}

Because file_get_contents function send a request don't include header infomation or use-agent information. CURL generate a request like a request of browser. And walmart, amazon, facebook, etc... don't detained your request

Kelvin
  • 690
  • 3
  • 11
  • Hmm, it works weird for some URLs. For example, i got only special characters with the URL http://www.official.my/freebacklinks.php ....Any idea ? – user3392106 Feb 23 '16 at 15:45