1

Have a few existing sites that I need to input all of the URLS and content into Elasticsearch, so using some examples, I built a way to grab all of the internal links on the site and put them in an array. The problem is every example I've seen is not fully recursive, there's a depth setting. I've been trying for the past 6 hours (seriously) of different ways to make it fully recursive. Here's how I am doing it now, but I think I'm infinite looping and crashing because after a minute of running I get no errors, just a "No Data Received" page. I'm open to any suggestions on a better approach.

    <?php
    set_time_limit (1209600);
    ini_set('memory_limit', '-1');

    $seen = array();
    $urls = crawl_page("http://example.com", $seen);

    foreach($urls as $url){
        echo $url.'<br />';
    }

    function crawl_page($url, $seen){

        //CURL TO GRAB PAGE
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_TIMEOUT, 60);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
        $result = curl_exec ($ch);
        curl_close ($ch);


        //URL NO HTTP/S OR WWW
        $urlStripped = str_replace('www', '', $url);
        $urlStripped = str_replace('http://', '', $urlStripped);
        $urlStripped = str_replace('https://', '', $urlStripped);

        //ADD THIS URL TO THE ARRAY
        if(!in_array($url, $seen)){
            $seen[] = $url;
        }


        //GET ALL LINKS IN PAGE
        $stripped_file = strip_tags($result, "<a>");
        preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER ); 
        foreach($matches as $match){
            $href = $match[1];
            //MAKE SURE LINK ISNT DUPLICATE AND MAKE SURE THE LINK IS INTERNAL
            if(!in_array($href, $seen) && is_in_string($urlStripped, $href)){
                $seen[] = $href;
            }
        }

        //HERE'S WHERE THE PROBLEM LIES, ATTEMPTING TO MAKE THIS RECURSIVE. 
        //I'VE DONE THIS MANY DIFFERENT WAYS WITH NO LUCK.
        //I DON'T REALLY HAVE A REASON FOR IT'S CURRENT STATE.
        //I ENDED UP TAKING SHOTS IN THE DARK AND THATS WHAT LED ME TO ASK ON STACKOVERFLOW
        $seenTemp1 = $seen;
        foreach($seenTemp1 as $aUrl){
            $seenTemp2 = crawl_page($aUrl, $seenTemp1);
            $seen = array_merge($seen, $seenTemp2);
        }

        //RETRUN ARRAY
        return $seen;
    }

    function is_in_string($needle, $string){
        $before = strlen($string);
        $after = strlen(str_replace($needle, '', $string));

        if($before != $after){
            return true;
        }
        else{
            return false;
        }
    }

?>
user2430227
  • 303
  • 1
  • 3
  • 13
  • 1
    possible duplicate of [How to find all links / pages on a website](http://stackoverflow.com/questions/1439326/how-to-find-all-links-pages-on-a-website) –  Jul 08 '14 at 21:18
  • Yeah, but how do I do this in PHP so at the end I can loop through the final array and insert these urls into a DB that's attached to my elastic search. – user2430227 Jul 08 '14 at 21:22
  • 1
    You always want to have a depth setting, otherwise you could get stuck on a honey trap website (they have an infinite depth). I'm not seeing any `sleep()` here - you should add that if you don't want to effect a denial of service on other people's web servers. As to the problem, what has come out of debugging? A depth of (say) 5 should do most websites, discounting duplicates as they come around. – halfer Jul 08 '14 at 21:40
  • 1
    The way in which `$seen` is copied and modified in the `foreach` loop looks a little convoluted. Trace this with some known test data and you might find a buglet here. – halfer Jul 08 '14 at 21:44
  • Alright, I'm going to try this using a depth setting and I guess I should sleep(2) for every page while I'm at it. I'll post results. – user2430227 Jul 08 '14 at 21:57

0 Answers0