0

I'm a little out of my depth here but believe I am now on the right track. I want to take user supplied url's and store them in a database so that the links can then be used on a user profile page.

Now the links I'm hoping the users will supply will be for social media site, facebook and the like. Whilst looking for a solution to safely storing user supplied url's I found this page http://electrokami.com/coding/use-php-to-format-and-validate-a-url-with-these-easy-functions/. The code works but seems to remove nearly everything. If I used "www.example.com/user.php?u=borris" it just returns example.com is valid.

Then I found out about regular expressions and found this line of code

/(?:https?:\/\/)?(?:www\.)?facebook\.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[\w\-]*\/)*([\w\-\.]*)/

from this site https://gist.github.com/marcgg/733592 and another stack overflow post Check if a string contains a url and get contents of url php.

I tried to merge the code together so that I get something that would validate the link for a facebook profile or page. I don't want to get profile info, pics etc but my code's not right either, so rather than getting deeper into stuff I don't fully understand yet I thought asking for help was best.

Below is the code I mashed together which gave me the error "Warning: preg_match_all() [function.preg-match-all]: Compilation failed: unmatched parentheses at offset 29... on line 9"

<?php
// get url to check from the page parameter 'url'
// or use default http://example.com
$text = isset($_GET['url']) 
? $_GET['url'] 
: "http://www.vwrx-project.co.uk/user.php?u=borris";

$reg_exurl =             "/(?:http|https|ftp|ftps)?:\/\/)?(?:www\.)?facebook\.com\/(?:(?:\w)*#!\/)?(?:pages\/)?(?:[\w\-]*\/)*([\w\-\.]*)/";
preg_match_all($reg_exurl, $text, $matches);
$usedPatterns = array();
$url = '';
foreach($matches[0] as $pattern){
    if(!array_key_exists($pattern, $usedPatterns)){
        $usedPatterns[$pattern] = true;
        $url = $pattern;
    }
}

?>

--------------------------------------------------------- Additional ------------------------------------------------------------ I took a fresh look at the answer Dave provided me with today and felt I could work with it, it makes more sense to me from a code perspective as I can follow the process etc.

I got a system I'm partly happy with. If I supply a link http://www.facebook.com/#!/lilbugga which is a typical link from facebook (when clicking on your username/profile pic from your wall) I can get the result http://www.facebook.com/lilbugga which shows as valid.

What it can't handle is the link from facebook that isn't in a vanity/seo friendly format such as https://www.facebook.com/profile.php?id=4. If I allow my code to accept ? and = then I suspect I'm leaving my website/database open to attack which I don't want.

Whats the best option now? This is the code I have

<?php   
$dirty_url = "http://www.facebook.com/profile.php?id=4";  //user supplied link

//clean url leaving alphanumerics : / . only -  required to remove facebook link format with /#!/
$clean_url = preg_replace('#[^a-z0-9:/.]#i', '', $dirty_url); 

$parsed_url = parse_url($clean_url); //parse url to get brakedown of components

$safe_host = $parsed_url['host']; // safe host direct from parse_url

// str_replace to switch any // to a / inside the returned path - required due to preg_replace process above
echo $safe_path = str_replace("//", "/", ($parsed_url['path']));

if ($parsed_url['host'] == 'www.facebook.com') {
  echo "<a href=\"http://$safe_host$safe_path\" alt=\"facebook\" target=\"_new\">Facebook</a>";
} else {
    echo " :( invalid url";
}
?>
Community
  • 1
  • 1
lil_bugga
  • 81
  • 2
  • 14
  • what is expected output in `http://www.vwrx-project.co.uk/user.php?u=borris`? – Braj Jul 12 '14 at 19:18
  • Could you provide a few links that should or shouldn't be valid? – hex494D49 Jul 12 '14 at 19:20
  • I was hoping the http://www.vwrx-project.co.uk/user.php?u=borris would return as invalid. Its actually a link to the site I'm building and hoping to use this code on, but links to a profile page that does not exisit. I would like facebook.com/username, facebook.com/123456789, facebook.com/thisIsMyPage to be allowed, basically any genuine facebook link. I don't expect all my users will enter the http:// etc and some may even miss out the www. The main thing I want is to check that the link they supply is to facebook, and that I can store it safely inside a database – lil_bugga Jul 12 '14 at 19:31
  • Why not just something like `^\w*:\/\/(facebook|fbcdn|whatever)\/`? Also, you could download whatever page user has supplied, check if your parsers can understand it (be it Facebook profile page or whatever) and if not, present the error message to user notifying him that he supplied invalid URL. Regex cannot, for example, predict if specified user name exists in the external web site or not, it just checks if link looks legit which is insufficient anyway. – rr- Jul 12 '14 at 19:33
  • If all else fails, you can make your moderators confirm every change of these URLs, which is most safe, but also most expensive way of validating links to third party services. I've actually seen this solution in the wild, so at least some people believe it's good. – rr- Jul 12 '14 at 19:42
  • TL;DR — [FILTER_VALIDATE_URL](http://es1.php.net/manual/en/filter.filters.validate.php) – Álvaro González Jul 13 '14 at 07:43

2 Answers2

1

Not sure exactly what you are trying to accomplish, but it sounds like you could use parse_url for this:

<?php
   $parsed_url = parse_url($_GET['url']);
   //assume it's "http://www.vwrx-project.co.uk/user.php?u=borris"
   print_r($parsed_url);
   /*
     Array
     (
         [scheme] => http
         [host] => www.vwrx-project.co.uk
         [path] => /user.php
         [query] => u=borris
     )
   */
   if ($parsed_url['host'] == 'www.facebook.com') {
      //do stuff
   }
?>
dave
  • 62,300
  • 5
  • 72
  • 93
  • I'm not sure if thats the best route to take, as I said above I'm out of my depth here, but I don't think comparing against a specific string will work as the bit after the / to signify the user id, username or page can take many formats – lil_bugga Jul 12 '14 at 19:33
  • Well, you know the host will remain the same. – dave Jul 12 '14 at 19:50
  • I've looked at your code with a fresh head today and decided that I could work with it. I've adapted it and got it semi working but I still face issues due to how facebooks links now seem to work. I've edited my original question now to reflect the new issues. – lil_bugga Jul 13 '14 at 07:01
  • With a bit of graft and the allowance of a couple more characters in my preg_replace statement seems to have got this work so I'll accept this answer as it gave me the best basis to work from. – lil_bugga Jul 13 '14 at 17:21
0

I have taken some regex pattern from HERE

Get the matched groups.

(?:http|https|ftp|ftps(?:\/\/)?)?(?:www.|[-;:&=\+\$,\w]+@)([A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??((?:[-\+=&;%@.\w_]*)#?(?:[\w]*)?))

Online demo

Input:

www.example.com/user.php?u=borris
http://www.vwrx-project.co.uk/user.php?u=borris

Output:

MATCH 1
1.  [4-15]  `example.com`
2.  [15-33] `/user.php?u=borris`
3.  [25-33] `u=borris`
MATCH 2
1.  [45-63] `vwrx-project.co.uk`
2.  [63-81] `/user.php?u=borris`
3.  [73-81] `u=borris`
Community
  • 1
  • 1
Braj
  • 46,415
  • 5
  • 60
  • 76