0

I have been trying to get this script to work for literally days and I think that I'm almost there but it isn't quite working. My understanding of PHP is limited and I have hacked this script together from a lot of other scripts.

I have separated the script in to two parts. Each of these parts work when run on their own but when they are placed together the page never loads.

The first part checks a URL for any external links ignoring any nofollow links.

The second part checks the server response header for a url

The whole script should find any external, followed links on a web page and then check if any of links are broken.

Any help to get this working would be very much appreciated.

<?php

// This is the first part of the script to get a list of external links from a web page ignoring any nofollow links

// Set the parent URL
$url = 'http://www.example.com';
$pUrl = parse_url($url);

// Load the HTML into a DOMDocument
$doc = new DOMDocument;
@$doc->loadHTMLFile($url);

// Look for all the 'a' elements
$links = $doc->getElementsByTagName('a');


$numLinks = 0;
foreach ($links as $link) {

// Exclude if not a link or has 'nofollow'
preg_match_all('/\S+/', strtolower($link->getAttribute('rel')), $rel);
if (!$link->hasAttribute('href') || in_array('nofollow', $rel[0])) {
    continue;
}

// Exclude if internal link
$href = $link->getAttribute('href');

if (substr($href, 0, 2) === '//') {
    // Deal with protocol relative URLs as found on Wikipedia
    $href = $pUrl['scheme'] . ':' . $href;
}

$pHref = @parse_url($href);
if (!$pHref || !isset($pHref['host']) ||
    strtolower($pHref['host']) === strtolower($pUrl['host'])
) {
    continue;
}

// Increment counter otherwise
echo $link->getAttribute('href') . " - ";
$numLinks++;



// This is the second part of the script to check to see if the link returns no response or a 404 response.

// Reset $checkurl
$checkurl = '';

// Set the URL to check server response code
$checkurl = $link->getAttribute('href');


// Check header response for URL
file_get_contents($checkurl);
$response = $http_response_header[0];

// If 404 exists in response then set as 404
if (strpos($response,'404') !== false) {
$server_response = '404';
}

// If there is no response then set as 404
if ($response == '') {
$server_response = '404';
}

echo $server_response;
echo '<br>';

} 

?>
Ade Lewis
  • 57
  • 2
  • 8
  • 4
    `"My understanding of PHP is limited and I have hacked this script together from a lot of other scripts."` *This* is your real problem. Take the time right now go through and try and actually *understand* the code instead of hacking pieces together. We're not here to get your piecemeal code working for you, sorry. – Jonathon Reinhart Nov 19 '13 at 08:39
  • http://stackoverflow.com/questions/15770903/check-if-links-are-broken-in-php – Balaji Kandasamy Nov 19 '13 at 08:45
  • 3
    As stated, I have spent days trying to understand the code. There is a lot going on here for someone relatively new to PHP and part of the learning process is trial, error and asking people for help. I have learnt a lot from this process and gone through a lot of trial and error. I am at the point of needing help. If someone can point out where I am going wrong then I would have learned something new and won't need that help again. If your answer is that I go out and learn everything there is to know about PHP and how to achieve everything before trying to achieve anything that doesn't help. – Ade Lewis Nov 19 '13 at 08:50
  • Hi balajimca. Thanks for your link. I did try this option but unfortunately I don't have CURL installed. The second part of the script does work on it's own to get the server response but I just can't figure out why the script is hanging when both are added together. Do you think that if I switched hosting the CURL option would cure this? Thanks again. – Ade Lewis Nov 19 '13 at 08:55

0 Answers0