0

M trying to crawl some data from a URL with the help of simple html dom. But when id start my crawler its giving an error

** failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found**

i have tried cUrl but 404 error is thrown.

here my php simple dom code

function getURLContent($url)
{
$html = new simple_html_dom();
$html->load_file($url);
    /* i perfome some opetions here*/
}

and with cUrl

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, false);
$data = curl_exec($curl);
echo $data; 
curl_close($curl);

How could i do this..?

Thanks in advance..

Boann
  • 48,794
  • 16
  • 117
  • 146
chaitanyasingu
  • 121
  • 1
  • 13

3 Answers3

0

Yes try to configure the useragent

 curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
vincent kleine
  • 724
  • 1
  • 6
  • 22
0

add these to your code and try

curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
curl_setopt($ch, CURLOPT_HEADER, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers); //set headers
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // set true for https urls
Anil Meena
  • 903
  • 1
  • 12
  • 28
0

404 Error is obvious, page not found. Try Fiddler for catching the parameters needed as your physical browser catches, and pass the same parameters via cURL in your script.

If you are getting Blocked error page, means try changing User-Agent OR use a proxy address(you can easily get free proxy on internet) OR try to maintaining the session while requesting your page, Fiddler will help you in this.

Community
  • 1
  • 1
Yogesh Unavane
  • 265
  • 3
  • 11