Web scraping using php

Question

M trying to crawl some data from a URL with the help of simple html dom. But when id start my crawler its giving an error

** failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found**

i have tried cUrl but 404 error is thrown.

here my php simple dom code

function getURLContent($url)
{
$html = new simple_html_dom();
$html->load_file($url);
    /* i perfome some opetions here*/
}

and with cUrl

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, false);
$data = curl_exec($curl);
echo $data; 
curl_close($curl);

How could i do this..?

Thanks in advance..

You're either not using the right URL or the remote site is rejecting your requests because you've been detected as crawler. — Álvaro González, Nov 13 '13 at 11:11
fake a browser by sending correct headers, check this SO [post](http://stackoverflow.com/questions/1926876/can-a-curl-based-http-request-imitate-a-browser-based-request-completely) to give you an idea — gwillie, Nov 13 '13 at 11:14
yeah m using the correct url...even i print the url on the browser...when i copy paste the url in browser, it works totally fine.. — chaitanyasingu, Nov 13 '13 at 11:20
[Debugging PHP `cURL`](http://stackoverflow.com/questions/3757071/php-debugging-curl) — MackieeE, Nov 13 '13 at 11:24

score 0 · Answer 1 · answered Nov 13 '13 at 11:22

0

Yes try to configure the useragent

 curl_setopt($curl,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

answered Nov 13 '13 at 11:22

vincent kleine

724
1
6
22

Anil Meena · Answer 2 · 2013-11-13T13:27:38.033

0

add these to your code and try

curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");
curl_setopt($ch, CURLOPT_HEADER, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers); //set headers
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // set true for https urls

edited Nov 13 '13 at 13:27

answered Nov 13 '13 at 13:12

Anil Meena

903
1
12
28

score 0 · Answer 3 · edited May 23 '17 at 12:13

404 Error is obvious, page not found. Try Fiddler for catching the parameters needed as your physical browser catches, and pass the same parameters via cURL in your script.

If you are getting Blocked error page, means try changing User-Agent OR use a proxy address(you can easily get free proxy on internet) OR try to maintaining the session while requesting your page, Fiddler will help you in this.

Web scraping using php

3 Answers3