I'm trying to create a small URL crawler for internal use within the company I work for.
Currently, I have a helper class where all the magic happens and an index.php that displays the results.
What I'd like to happen, is for a URL to be given and the code to go away and fetch all page URLS that the site contains for display on the screen.
However, waiting until this foreach loop finishes takes an age and as a result, I'd like to echo the link after each iteration of the loop.
I can't get it to work. I don't know if it's the link fetching code, or my attempts to flush the output buffer. I've followed the examples in this question here: Echo 'string' while every long loop iteration (flush() not working)
My code is below (without the flushing attempts)
// INDEX.PHP
require_once('helper.php');
$helper = new Helper();
flush();
ob_flush();
$found = $helper->crawlSite('http://www.bbc.co.uk', 'http://www.bbc.uk');
echo count($found);
// HELPER.PHP
class Helper
{
private $checked = [];
private $foundUrls = [];
public function __construct()
{
}
public function getHTML($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$html = curl_exec($curl);
curl_close($curl);
return $html;
}
public function getTagFromHTML($html, $tag)
{
$dom = new DOMDocument();
$dom->loadHTML($html);
return $dom->getElementsByTagName($tag);
}
function crawlSite($url, $initialUrl)
{
$html = $this->getHTML($url);
$links = $this->getTagFromHTML($html, 'a');
foreach ($links as $link) {
echo $link->getAttribute('href') . '<br>';
flush();
ob_flush();
if (!in_array($link->getAttribute('href'), $this->checked)) {
if (strpos($link->getAttribute('href'), $initialUrl) !== FALSE) {
$this->foundUrls[] = $link->getAttribute('href');
$this->crawlSite($link->getAttribute('href'), $initialUrl);
} else {
$this->foundUrls[] = $initialUrl . $link->getAttribute('href');
$this->crawlSite($initialUrl . $link->getAttribute('href'), $initialUrl);
}
$this->checked[] = $link->getAttribute('href');
}else{
echo "Already Checked <br>";
flush();
ob_flush();
}
}
return $this->foundUrls;
}
}
Update
Updated the code to a larger site to demonstrate the problem. Also included one of my attempts at flushing the output buffer and I also implemented @Dev Jyoti Behera's suggestion of moving the echo.
Update 2
Thanks to the suggestion (as mentioned above), I can now see live text being printed on the screen. I now have a second problem however, where the crawler seems to be ignoring the has been checked if statement and it will check and list the same URL over and over. /sigh - I love programming, honestly.
';` too, but I don't think that should make any difference. – Qirel Apr 08 '16 at 15:47