Below is a function for doing what you ask, with references to Stack Overflow answers which give the details you need.
First:
Parse the URL using the PHP standard filter_var
Validate (and Sanitise) functions. You may also need to ensure that the scheme is properly defined.
Second,
Run a PHP cURL request to get the HTTP header of the full URL and then of the site URL. Source.
$url = 'http://www.example.com/folder/file.php';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true); // we want headers
curl_setopt($ch, CURLOPT_NOBODY, true); // we don't need body
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
$output = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
echo 'HTTP code: ' . $httpcode;
Third
If the $httpcode
returns a 200 then it's good working link, else we need to cut the link down to just the site and recheck if the site (still) exists. You can do this using Parse_url. Source.
so:
if($httpcode == 200){
//works
}
if($httpcode >= 400 ){
/*** errors 400+ ***/
$siteUrlParts = parse_url($url);
$siteUrl = $siteUrlParts['scheme']."//".$siteUrlParts['host'];
}
else {
//some other header, up to you how you want to handle this.
// could be a redirect 302 or something...
}
Note the schema
part is important not just the host
part.
Fourth
That's it, update the database row with the new working URL.
All Together:
function get_header_code($url){
/***
cURL
***/
$ch = curl_init($link);
curl_setopt($ch, CURLOPT_HEADER, true); // we want headers
curl_setopt($ch, CURLOPT_NOBODY, true); // we don't need body
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
$output = curl_exec($ch);
$httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return $httpCode;
}
function clean_url($link){
$link = strtolower($link);
$link = filter_var($link, FILTER_SANITIZE_URL);
if(substr($link,0,8) !== "https://" && substr($link,0,7) !== "http://"){
$link = "http://".$link;
}
if(filter_var($link, FILTER_VALIDATE_URL) === FALSE){
/***
Invalid URL so clean and remove.
***/
return false;
}
$httpCode = get_header_code($link);
if($httpCode == 200){
/***
works, so return full URL
***/
return $link;
}
if($httpcode >= 400 ){
/*** errors 400+ ***/
$siteUrlParts = parse_url($link);
$siteUrl = $siteUrlParts['scheme']."://".$siteUrlParts['host'];
if(get_header_code($siteUrl) == 200){
/***
Obviously you can add conditionals to accept if it is a
redirection but this is a basic example
***/
return $siteUrl;
}
return false;
}
else {
/***
some other header, up to you how you want to handle this.
could be a redirect 301, 302 or something...
***/
return false;
}
}
And run it as:
/***
returns either false or the URL of a working domain from the Db.
***/
$updateValueUrl = clean_url($databaseRow['url']);
This is probably not quite perfect for you but should give you a good grounding from which to make your desired behaviour. Once this is in place you then can run a PHP MySQL loop to grab every URL (in LIMIT
batches of maybe 500 or 1000) at a time and loop through each one using foreach
and updating each with the output from these functions.