1

Im talking about performing a deep recursion for around 5+ mins, something that you may have a crawler perform. in order to extract url links and and sub-url links of pages

it seems that deep recursion in PHP does not seem realistic

e.g.

getInfo("www.example.com");

function getInfo($link){
   $content = file_get_content($link)

   if($con = $content->find('.subCategories',0)){
      echo "go deeper<br>";
      getInfo($con->find('a',0)->href);
   }

   else{
      echo "reached deepest<br>";
   }
}
mk_89
  • 2,692
  • 7
  • 44
  • 62
  • It's no less realistic than with any other language. As long as you configure it not to observe execution time limits, and code your recursion with appropriate exits. – Michael Berkowski Jun 30 '12 at 22:02
  • It's perfectly realistic, but you'll probably want to keep a list of previously visited links to avoid infinite loops. – Ry- Jun 30 '12 at 22:02

1 Answers1

8

Doing something like this with recursion is actually a bad idea in any language. You cannot know how deep that crawler will go so it might lead to a Stack Overflow. And if not it still wastes a bunch of memory for the huge stack since PHP has no tail-calls (not keeping any stack information unless necessary).

Push the found URLs into a "to crawl" queue which is checked iteratively:

$queue = array('www.example.com');
$done = array();
while($queue) {
    $link = array_shift($queue);
    $done[] = $link;
    $content = file_get_contents($link);
    if($con = $content->find('.subCategories', 0)) {
        $sublink = $con->find('a', 0)->href;
        if(!in_array($sublink, $done) && !in_array($sublink, $queue)) {
            $queue[] = $sublink;
        }
    }
}
mk_89
  • 2,692
  • 7
  • 44
  • 62
ThiefMaster
  • 310,957
  • 84
  • 592
  • 636
  • You'll probably want to make a note about `[]` requiring PHP 5.4. Oh wait, I guess this comment works as that note too. – Ry- Jun 30 '12 at 22:04
  • Or I just replace it with `array()`. Too used to proper languages which have `[]` for ages... @Eric: Nope, `[]` for array literal is new in PHP 5.4. – ThiefMaster Jun 30 '12 at 22:05
  • 1
    @ThiefMaster: Congrats with your moderator title. Can you elaborate on why it is bad to do something like this in any language? I think it will benefit the answer – bart s Jun 30 '12 at 22:06
  • I recently built a crawler for a large project, and used a similar approach, only the queue was in a database, and I ran the crawler on a cron job... – Mark Eirich Jun 30 '12 at 22:07
  • @ThiefMaster: Thanks for the elaboration – bart s Jun 30 '12 at 22:10
  • @bart: For one thing, you keep growing the stack for no good reason. Also, the queue approach makes it easier to save the program state and resume it later, and to deal with transient errors and duplicate links. – Ilmari Karonen Jun 30 '12 at 22:10