1

I am a php newb but I am pretty sure this will be hard to accomplish and very server consuming. But I want to ask, get the opinion of much smarter users than myself.

Here is what I am trying to do:

I have a list of URL's, an array of URL's actually.

For each URL, I want to count the outgoing links - which DO NOT HAVE REL="nofollow" attribute - on that page.

So in a way, I'm afraid I'll have to make php load the page and preg match using regular expressions all the links?

Would this work if I'd had lets say 1000 links?

Here is what I am thinking, putting it in code:

$homepage = file_get_contents('http://www.site.com/');

$homepage = htmlentities($homepage);

// Do a preg_match for http:// and count the number of appearances:
$urls = preg_match();

// Do a preg_match for rel="nofollow" and count the nr of appearances:
$nofollow = preg_match();


// Do a preg_match for the number of "domain.com" appearances so we can subtract the website's internal links:
$internal_links = preg_match();

// Substract and get the final result: 
$result = $urls - $nofollow - $internal_links;

Hope you can help, and if the idea is right maybe you can help me with the preg_match functions.

webmasters
  • 5,663
  • 14
  • 51
  • 78
  • Probably save yourself a lot of time if you used a DOM parser or PHPQuery instead of regexing. – Jared Farrish Jan 19 '13 at 03:29
  • [Don't use regexes for parsing HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – John Conde Jan 19 '13 at 03:30
  • I am a newb, and really new to this, I appreciate the tip. Maybe you can elaborate and if it works I can approve the response. – webmasters Jan 19 '13 at 03:31

2 Answers2

4

You can use PHP's DOMDocument class to parse the HTML and parse_url to parse the URLs:

$url = 'http://stackoverflow.com/';
$pUrl = parse_url($url);

// Load the HTML into a DOMDocument
$doc = new DOMDocument;
@$doc->loadHTMLFile($url);

// Look for all the 'a' elements
$links = $doc->getElementsByTagName('a');

$numLinks = 0;
foreach ($links as $link) {

    // Exclude if not a link or has 'nofollow'
    preg_match_all('/\S+/', strtolower($link->getAttribute('rel')), $rel);
    if (!$link->hasAttribute('href') || in_array('nofollow', $rel[0])) {
        continue;
    }

    // Exclude if internal link
    $href = $link->getAttribute('href');

    if (substr($href, 0, 2) === '//') {
        // Deal with protocol relative URLs as found on Wikipedia
        $href = $pUrl['scheme'] . ':' . $href;
    }

    $pHref = @parse_url($href);
    if (!$pHref || !isset($pHref['host']) ||
        strtolower($pHref['host']) === strtolower($pUrl['host'])
    ) {
        continue;
    }

    // Increment counter otherwise
    echo 'URL: ' . $link->getAttribute('href') . "\n";
    $numLinks++;

}

echo "Count: $numLinks\n";
PleaseStand
  • 31,641
  • 6
  • 68
  • 95
  • Ty very much, I have decided to accept your reply since I believe it's better not to rely on extra scripts and libraries and because your answer is very complete, well written from start to finish. – webmasters Jan 21 '13 at 02:14
2

You can use SimpleHTMLDOM:

// Create DOM from URL or file
$html = file_get_html('http://www.site.com/');

// Find all links 
foreach($html->find('a[href][rel!=nofollow]') as $element) {
    echo $element->href . '<br>';
}

As I'm not sure that SimpleHTMLDOM supports a :not selector and [rel!=nofollow] might only return a tags with a rel attribute present (and not ones where it isn't present), you may have to:

foreach($html->find('a[href][!rel][rel!=nofollow]') as $element)

Note the added [!rel]. Or, do it manually instead of with a CSS attribute selector:

// Find all links 
foreach($html->find('a[href]') as $element) {
    if (strtolower($element->rel) != 'nofollow') {
        echo $element->href . '<br>';
    }
}
Jared Farrish
  • 48,585
  • 17
  • 95
  • 104
  • Ty so much for answering. I'm not sure why my q was down-voted, I really did not know using a regex is a bad idea, tried to find a solution myself... so ty again! – webmasters Jan 19 '13 at 03:55
  • One more thing, for internal links, I should try and get the $element's domain in the foreach loop (using your second method, css attr selector) – webmasters Jan 19 '13 at 03:57
  • @webmasters - Not sure what you mean (within an `href`?). – Jared Farrish Jan 19 '13 at 04:00
  • Here are the other attribute selectors with wildcard operators: http://simplehtmldom.sourceforge.net/manual.htm#section_access – Jared Farrish Jan 19 '13 at 04:02
  • I am reading the link you gave me. Yes, whith this I identify the nofollow: strtolower($element->rel) == 'nofollow' - and I need a way to identify the internal links like: mydomain.com/link.html – webmasters Jan 19 '13 at 04:04
  • I don't understand. Have you looked at the docs? `$element->href` is available within the `for` loop, but are you looking for a specific domain only? An attribute selector work in that case, `[href*=domain.com]? `strpos` could be used as well, `strpos($element->href, 'domain.com') === false`. It just depends. – Jared Farrish Jan 19 '13 at 04:09
  • Note as well I make a small error, it should be `strtolower($element->rel) != 'nofollow'`, not `==`. – Jared Farrish Jan 19 '13 at 04:09
  • Got it! Ty! You've been a great help ;) – webmasters Jan 19 '13 at 04:16