I want to crawl an entire website , I have read several threads but I cannot manage to get data in a 2nd level.
That is, I can return the links from a starting page but then I cannot find a way to parse the links and get the content of each link...
The code I use is:
<?php
// SELECT STARTING PAGE
$url = 'http://mydomain.com/';
$html= file_get_contents($url);
// GET ALL THE LINKS OF EACH PAGE
// create a dom object
$dom = new DOMDocument();
@$dom->loadHTML($html);
// run xpath for the dom
$xPath = new DOMXPath($dom);
// get links from starting page
$elements = $xPath->query("//a/@href");
foreach ($elements as $e) {
echo $e->nodeValue. "<br />";
}
// Parse each page using the extracted links?
?>
Could somebody help me out for the last part with an example?
I will be really much appreciated!
Well , thanx for your answers! I tried some stuff but I Haven't managet to get any results yet - I am new to programming..
Below, you can find 2 of my attempts - the 1st trying to parse the links and in the second trying to replace file_get contents with Curl:
1)
<?php
// GET STARTING PAGE
$url = 'http://www.capoeira.com.gr/';
$html= file_get_contents($url);
//GET ALL THE LINKS FROM STARTING PAGE
// create a dom object
$dom = new DOMDocument();
@$dom->loadHTML($html);
// run xpath for the dom
$xPath = new DOMXPath($dom);
// get specific elements from the sites
$elements = $xPath->query("//a/@href");
//PARSE EACH LINK
foreach($elements as $e) {
$URLS= file_get_contents($e);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$output = $xPath->query("//div[@class='content-entry clearfix']");
echo $output ->nodeValue;
}
?>
For the above code I get Warning: file_get_contents() expects parameter 1 to be string, object given in ../example.php on line 26
2)
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_POST, 1);
curl_setopt($curl, CURLOPT_URL, "http://capoeira.com.gr");
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$content= curl_exec($curl);
curl_close($curl);
$dom = new DOMDocument();
@$dom->loadHTML($content);
$xPath = new DOMXPath($dom);
$elements = $xPath->query("//a/@href");
foreach ($elements as $e) {
echo $e->nodeValue. "<br />";
}
?>
I get no results. I tried to echo $content and then I get :
You don't have permission to access / on this server.
Additionally, a 413 Request Entity Too Large error was encountered while trying to use an ErrorDocument to handle the request...
Any ideas please?? :)