0

I'd like to use PHP to crawl a document we have that has about 6 or 7 thousand href links in it. What we need is what is on the other side of the link which means that PHP would have to follow each link and grab the contents of the link. Can this be done?

Thanks

6 Answers6

1

Sure, just grab the content of your starting url with a function like file_get_contents (http://nl.php.net/file_get_contents), Find URL's in the content of this page using a regular expression, grab the contents of those url's etcetera.

Regexp will be something like:

$regexUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
Sander
  • 19
  • 2
  • Thanks Sander. OK, so once I get the contents, say all of the URLs, I'd loop through each one but how do I tell PHP to follow the link? –  Sep 17 '09 at 08:24
  • Hey Sander, couldn't I use file_get_contents() for each link as well? –  Sep 17 '09 at 08:28
  • Yes, you can use file_get_contents() to get the content of the links within the pages. Basically you repeat the "get url content + extract links from it" process for each link you find. – Sander Sep 17 '09 at 09:25
0

You can try the following. See this thread for more details

<?php
//set_time_limit (0);
function crawl_page($url, $depth = 5){
$seen = array();
if(($depth == 0) or (in_array($url, $seen))){
    return;
}   
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
$result = curl_exec ($ch);
curl_close ($ch);
if( $result ){
    $stripped_file = strip_tags($result, "<a>");
    preg_match_all("/<a[\s]+[^>]*?href[\s]?=[\s\"\']+"."(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/", $stripped_file, $matches, PREG_SET_ORDER ); 
    foreach($matches as $match){
        $href = $match[1];
            if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($url, array('path' => $path));
                } else {
                    $parts = parse_url($url);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            crawl_page($href, $depth - 1);
        }
}   
echo "Crawled {$href}";
}   
crawl_page("http://www.sitename.com/",3);
?>
Community
  • 1
  • 1
Team Webgalli
  • 730
  • 4
  • 13
0

I just have a SQL table of all the links I have found, and if they have been parsed or not.

I then use Simple HTML DOM to parse oldest added page, although as it tends to run out of memory with large pages (500kb+ of html) I use regex for some of it*. For every link I find I add it to the SQL database as needing parsing, and the time I found it.

The SQL database prevents the data being lost on an error, and as I have 100,000+ links to parse, I do it over a long period of time.

I am unsure, but have you checked the useragent of file_get_contents()? If it isn't your pages and you make 1000s of requests, you may want to change the user agent, either by writing your own HTTP down loader or using one from a library(I use the one in the Zend Framework) but cURL etc work fine. If you use a custom user agent, it allows the admin looking over logs to see the information about your bot. (I tend to put the reason why I am crawling and a contact in mine).

*The regex I use is:

'/<a[^>]+href="([^"]+)"[^"]*>/is'

A better solution (From Gumbo) could be:

'/<a\s+(?:[^"'>]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i'
Yacoby
  • 54,544
  • 15
  • 116
  • 120
0

Once you harvest the links, you can use curl or file_get_contents (in a safe environment file_get_contents shouldn't allow to walk over http protocol though)

Eineki
  • 14,773
  • 6
  • 50
  • 59
  • Hi Eineki, what I have is the main html doc with 6K links. I figured I would parse those out first and then work on getting the data behind them. I don't have access to curl though. Will this be an issue? What are my options? –  Sep 17 '09 at 08:43
  • if _allow_url_fopen_ is enabled in _php.ini_, you can use any of the functions that have a filename as parameter to open a URL (maybe you can't use include and require, I'm not shure about them). readfile, fopen, get_file_contents are your choice, maybe there are others. if _allow_url_fopen_ maybe you should recover on use the sockets manually, but I wouldn't like to be in your shoes ;) – Eineki Sep 17 '09 at 09:44
0

The PHP Snoopy library has a bunch of built in functions to accomplish exactly what you are looking for.

http://sourceforge.net/projects/snoopy/

You can download the page itself with Snoopy, then it has another function to extract all the URLs on that page. It will even correct the links to be full-fledged URIs (i.e. they aren't just relative to the domain/directory the page resides on).

Nolte
  • 1,096
  • 5
  • 11
-1

I suggest that you take the HTML document with your 6000 URLs, parse them out and loop through the list you've got. In your loop, get the contents of the current URL using file_get_contents (for this purpose, you don't really need cURL when file_get_contents is enabled on your server), parse out the containing URLs again, and so on.

Would look something like this:

<?php
function getUrls($url) {
    $doc = file_get_contents($url);
    $pattern = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
    preg_match_all($pattern, $doc, $urls);
    return $urls;
}

$urls = getUrls("your_6k_file.html"); 
foreach($urls as $url) {
    $moreUrls = getUrls($url); 
    //do something with moreUrls
}
?>
Alex
  • 12,205
  • 7
  • 42
  • 52