1

Here is my code which is a partially based on a few different codes that you can find easily in various places if googled. I'm trying to count the internal and external links, all links and ( TO DO .nofollow ) links on a any webpage. This is what I have till now. Most of the results are correct, some generic calls gives me a weird results though, and I still need to do .nofollow and perhaps _blank as well. If you care to comment or add/change anything with bit of logic explanation then please do so, it will be very appreciated.

<?php

    // transform to absolute path function... 
    function path_to_absolute($rel, $base)
  {
    /* return if already absolute URL */
    if (parse_url($rel, PHP_URL_SCHEME) != '') return $rel;
    /* queries and anchors */
    if ($rel[0]=='#' || $rel[0]=='?') return $base.$rel;
    /* parse base URL and convert to local variables:
       $scheme, $host, $path */
    extract(parse_url($base));
    /* remove non-directory element from path */
    $path = preg_replace('#/[^/]*$#', '', $path);
    /* destroy path if relative url points to root */
    if ($rel[0] == '/') $path = '';
    /* dirty absolute URL */
    $abs = "$host$path/$rel";
    /* replace '//' or '/./' or '/foo/../' with '/' */
    $re = array('#(/\.?/)#', '#/(?!\.\.)[^/]+/\.\./#');
    for($n=1; $n>0; $abs=preg_replace($re, '/', $abs, -1, $n)) {}
    /* absolute URL is ready! */
    return $scheme.'://'.$abs;
  }


// count zero begins 
$intnumLinks = 0;
$extnumLinks = 0;
$nfnumLinks = 0;
$allnumLinks = 0;

// get url file
$url = $_REQUEST['url'];
// get contents of url file
$html = file_get_contents($url);
// http://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php
// loading DOM document
$doc=new DOMDocument();
@$doc->loadHTML($html);

$xml=simplexml_import_dom($doc); // just to make xpath more simple
$strings=$xml->xpath('//a');
foreach ($strings as $string) {



    $aa = path_to_absolute( $string[href], $url, true );
    $a = parse_url($aa, PHP_URL_HOST);
    $a = str_replace("www.", "", $a);

    $b = parse_url($url, PHP_URL_HOST);

    if($a == $b){
    echo 'call-host: ' . $b . '<br>';
    echo 'type: int </br>';
    echo 'title: ' . $string[0] . '<br>';
    echo 'url: ' . $string['href'] . '<br>';
    echo 'host: ' . $a . '<br><br>';
    $intnumLinks++;
    }else{
    echo 'call-host: ' . $b . '<br>';
    echo 'type: ext </br>';
    echo 'title: ' . $string[0] . '<br>';
    echo 'url: ' . $string['href'] . '<br>';
    echo 'host: ' . $a . '<br><br>';
    $extnumLinks++;
    }
    $allnumLinks++;

}

// count results 
echo "<br>";
echo "Count int: $intnumLinks <br>";
echo "Count ext: $extnumLinks <br>";
echo "Count nf: $nfnumLinks <br>";
echo "Count all: $allnumLinks <br>";
?>

Consider this post as closed. At first I wanted to delete this post but then again someone might use this code for his work.

pc_
  • 578
  • 8
  • 21
  • I don't see a question here - maybe you're looking for a [code review](http://codereview.stackexchange.com/help/on-topic)? – HPierce Dec 14 '16 at 21:25
  • @HPierce, yes . Ill post it overthere. I wasn't aware of this section. Apologies and Thx for pointing that out. – pc_ Dec 14 '16 at 21:30
  • http://codereview.stackexchange.com/questions/149923/fetch-internal-and-external-links-count-from-a-webpage-with-php I just posted at code review section. – pc_ Dec 14 '16 at 22:02

0 Answers0