0

I have a DB with URLs of manufacturers collected in the last years and I need to do some spring cleaning:

  1. Some urls are like http://brandname.com/aboutus/ so i need to remove any path other than just the main domain, because many of those path/subdirectory may have expired...

  2. I would love to be able to check if those domain actually exists anymore or are taken by domain sharks...

I'm currently using PHP+MySQL

WillardSolutions
  • 2,316
  • 4
  • 28
  • 38
Francesco
  • 24,839
  • 29
  • 105
  • 152
  • 1
    Well, what is your question here? Obviously you need to take the URLs one by one, use `parse_url()` to pick the tokens you need, so scheme and hostname here, and then make a test request. I'd even say that you are not at all interested in the domains, but in the hostnames, since a domain without web service most likely is without value to you... – arkascha Oct 12 '16 at 16:14
  • Use regular expression http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url – Suraj Oct 12 '16 at 16:21
  • @arkascha thanks for pointing me to parse_url! – Francesco Oct 12 '16 at 16:22
  • @Suraj Regular expressions are only a last means if no better matching function exists... – arkascha Oct 12 '16 at 16:23

1 Answers1

2

Below is a function for doing what you ask, with references to Stack Overflow answers which give the details you need.

First:
Parse the URL using the PHP standard filter_var Validate (and Sanitise) functions. You may also need to ensure that the scheme is properly defined.

Second,
Run a PHP cURL request to get the HTTP header of the full URL and then of the site URL. Source.

$url = 'http://www.example.com/folder/file.php';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);    // we want headers
curl_setopt($ch, CURLOPT_NOBODY, true);    // we don't need body
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
$output = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);

echo 'HTTP code: ' . $httpcode;

Third
If the $httpcode returns a 200 then it's good working link, else we need to cut the link down to just the site and recheck if the site (still) exists. You can do this using Parse_url. Source.

so: 
if($httpcode == 200){
    //works
}
if($httpcode >= 400 ){
     /*** errors 400+ ***/
    $siteUrlParts = parse_url($url);
    $siteUrl = $siteUrlParts['scheme']."//".$siteUrlParts['host'];
}
else {
   //some other header, up to you how you want to handle this.
   // could be a redirect 302 or something...  
}

Note the schema part is important not just the host part.

Fourth
That's it, update the database row with the new working URL.

All Together:

function get_header_code($url){
    /*** 
     cURL
     ***/
    $ch = curl_init($link);
    curl_setopt($ch, CURLOPT_HEADER, true);    // we want headers
    curl_setopt($ch, CURLOPT_NOBODY, true);    // we don't need body
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_TIMEOUT,10);
    $output = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    return $httpCode;
}

function clean_url($link){
    $link = strtolower($link);
    $link = filter_var($link, FILTER_SANITIZE_URL);

    if(substr($link,0,8) !== "https://" && substr($link,0,7) !== "http://"){
        $link = "http://".$link;
    }

    if(filter_var($link, FILTER_VALIDATE_URL) === FALSE){
    /***
     Invalid URL so clean and remove.
     ***/
    return false;
    }
    $httpCode = get_header_code($link);

    if($httpCode == 200){
      /***
       works, so return full URL
       ***/
      return $link;
    }
    if($httpcode >= 400 ){
     /*** errors 400+ ***/
        $siteUrlParts = parse_url($link);
        $siteUrl = $siteUrlParts['scheme']."://".$siteUrlParts['host'];
        if(get_header_code($siteUrl) == 200){
             /***
              Obviously you can add conditionals to accept if it is a
              redirection but this is a basic example
              ***/  
             return $siteUrl;
        }
        return false;
    }
    else {
       /***
        some other header, up to you how you want to handle this.
        could be a redirect 301, 302 or something... 
        ***/
       return false; 
    }

}

And run it as:

/***
 returns either false or the URL of a working domain from the Db.
 ***/
$updateValueUrl = clean_url($databaseRow['url']);

This is probably not quite perfect for you but should give you a good grounding from which to make your desired behaviour. Once this is in place you then can run a PHP MySQL loop to grab every URL (in LIMIT batches of maybe 500 or 1000) at a time and loop through each one using foreach and updating each with the output from these functions.

Community
  • 1
  • 1
Martin
  • 22,212
  • 11
  • 70
  • 132