1

I've grab a piece of script off here to crawl a website, put it up on my server and it works. The only issue is that if I try and crawl set the depth anything above 4 it doesn't work. I'm wondering if it due to the servers lack of resources or the code itself.

<?php

error_reporting(E_ALL); 

function crawl_page($url, $depth)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }
    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
            $href = rtrim($url, '/') . '/' . ltrim($href, '/');
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:",$url,PHP_EOL;
    echo  "<br/>";
}
crawl_page("http://www.mangastream.com/", 2);
?>

EDIT:

I turned on the error reporting for the script and all I get is this

Error 324 (net::ERR_EMPTY_RESPONSE): Unknown error.

dbomb101
  • 413
  • 1
  • 8
  • 21
  • "I try and crawl set the crawler above level 4 *it doesn't work*" And that means? – Albireo Apr 11 '11 at 09:04
  • I meant to say if I set the depth variable above 4 it doesn't produce any results – dbomb101 Apr 11 '11 at 09:53
  • 1
    It would timeout anyway after the default time_limit is exceeded and setting this limit to a higher value is dangerous (several processes never ending and sucking memory can easily kill a server). You'd better use a real crawler like phpDig. – Capsule Apr 11 '11 at 12:53

1 Answers1

0

Try making sure you have all error messages on (display_errors, error_reporting). This should give you more insight as to why it's crashing.

Also, keep in mind that crawling is often illegal depending on what you're going to do with the data.

Evert
  • 93,428
  • 18
  • 118
  • 189