0

I'm trying to create a small URL crawler for internal use within the company I work for.

Currently, I have a helper class where all the magic happens and an index.php that displays the results.

What I'd like to happen, is for a URL to be given and the code to go away and fetch all page URLS that the site contains for display on the screen.

However, waiting until this foreach loop finishes takes an age and as a result, I'd like to echo the link after each iteration of the loop.

I can't get it to work. I don't know if it's the link fetching code, or my attempts to flush the output buffer. I've followed the examples in this question here: Echo 'string' while every long loop iteration (flush() not working)

My code is below (without the flushing attempts)

// INDEX.PHP

require_once('helper.php');

$helper = new Helper();

flush();
ob_flush();

$found = $helper->crawlSite('http://www.bbc.co.uk', 'http://www.bbc.uk');

echo count($found);


// HELPER.PHP

class Helper
{
    private $checked = [];
    private $foundUrls = [];

    public function __construct()
    {

    }

    public function getHTML($url)
    {
        $curl = curl_init($url);

        curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
        $html = curl_exec($curl);
        curl_close($curl);

        return $html;
    }

    public function getTagFromHTML($html, $tag)
    {
        $dom = new DOMDocument();
        $dom->loadHTML($html);

        return $dom->getElementsByTagName($tag);
    }

    function crawlSite($url, $initialUrl)
    {
        $html = $this->getHTML($url);
        $links = $this->getTagFromHTML($html, 'a');

        foreach ($links as $link) {
            echo $link->getAttribute('href') . '<br>';

            flush();
            ob_flush();

            if (!in_array($link->getAttribute('href'), $this->checked)) {
                if (strpos($link->getAttribute('href'), $initialUrl) !== FALSE) {
                    $this->foundUrls[] = $link->getAttribute('href');
                    $this->crawlSite($link->getAttribute('href'), $initialUrl);
                } else {
                    $this->foundUrls[] = $initialUrl . $link->getAttribute('href');
                    $this->crawlSite($initialUrl . $link->getAttribute('href'), $initialUrl);
                }

                $this->checked[] = $link->getAttribute('href');
            }else{
                echo "Already Checked <br>";

                flush();
                ob_flush();
            }
        }


        return $this->foundUrls;
    }
}

Update

Updated the code to a larger site to demonstrate the problem. Also included one of my attempts at flushing the output buffer and I also implemented @Dev Jyoti Behera's suggestion of moving the echo.

Update 2

Thanks to the suggestion (as mentioned above), I can now see live text being printed on the screen. I now have a second problem however, where the crawler seems to be ignoring the has been checked if statement and it will check and list the same URL over and over. /sigh - I love programming, honestly.

Community
  • 1
  • 1
Lewis
  • 3,479
  • 25
  • 40
  • The code you updated with contains more left-brackets than closing ones. And it should echo out just fine if you just fix the number of brackets. You can just place it directly after the `foreach($links as $link) { echo $link->getAttribute('href') . '
    ';` too, but I don't think that should make any difference.
    – Qirel Apr 08 '16 at 15:47
  • Does the `echo count($found);` line return a value besides 0? – WillardSolutions Apr 08 '16 at 15:48
  • Thanks Qirel. @EatPeanutButter On a small site, like the one mentioned in the question, it returns 4 (which is correct in this case). On a larger site, however, I can't tell as I just get a spinner. – Lewis Apr 08 '16 at 15:55
  • I see that $link->getAttribute('href') does not change inside the loop. Will it work for you to move the echo $link->getAttribute('href') line to the start of the loop body, before the if-else statement? This way, you will be able to see the link that is currently being crawled on. The way the code is written currently causes the link to be printed after all the crawling is over(which can take a very long time). – trans1st0r Apr 08 '16 at 15:55
  • @DevJyotiBehera I'll certainly try that - however, doesn't $link change on each loop being one of the loop parameters? – Lewis Apr 08 '16 at 15:57
  • 1
    @Lewis: It sure does. But, consider adding it to the first line of the foreach loop's body. This way, for every iteration and $link, a new value will be printed. – trans1st0r Apr 08 '16 at 15:59
  • @DevJyotiBehera Thanks for that, it has indeed solved the biggest part of my issue (it now echos as expected). However, as per the updated question, more problems =( – Lewis Apr 08 '16 at 16:18
  • cool! after the inner if-else statement, you do `$this->checked[] = $link->getAttribute('href');` . This is to have a way to save all visited links and check if a link that is about to be crawled is still there in this array. However, you will need to add this line to the start of the outer if-else block. This will make sure that before another round of crawling begins inside the inner if-else, the link is already saved in the list and its presence can be checked in the recursive calls. – trans1st0r Apr 08 '16 at 17:09

1 Answers1

-2

Have you tried using ob_flush()? Here is an example. Maybe this helps: https://gist.github.com/jtallant/3260398

Klaus
  • 416
  • 4
  • 15
  • Hi, thanks for the reply! Yep, I've tried that (was in the answers in the question I linked) – Lewis Apr 08 '16 at 15:45