0

I'm looking for some form of input security on a project I working on. Basically I wish to flag text if the user has inputted any form of a URL.

IE 'For more of my pic visit myhotpic.net'

Hence it would detect a url and then I can flag the string for validation via staff. So I would need to check for any form of a URL.

There is a similar question here Finding urls from text string via php and regex? with an answer. But I have tired this with various strings and I do not get the expected results.

For example

$pattern = '#(www\.|https?:\/\/){?}[a-zA-Z0-9]{2,254}\.[a-zA-Z0-9]{2,4}(\S*)#i'; $count = preg_match_all($pattern, 'http://www.Imaurl.com', $matches, PREG_PATTERN_ORDER);

returns matches as

array(3) {
  [0]=>
  array(0) {
  }
  [1]=>
  array(0) {
  }
  [2]=>
  array(0) {
  }
}

and no error is return via preg_last_error()

Why is this not working? Is there an error in the Regex? I would assume it to be fine as other users have had success with it.

I cannot seem to find a suitable answer for my problem anywhere else.

Community
  • 1
  • 1
Shane
  • 2,375
  • 4
  • 22
  • 31
  • 1
    turn the `{?}` into just `?` and it seems to work fine. no idea what `{?}` is supposed to do; i've never seen anything like that. – user428517 Jul 31 '13 at 20:55
  • I'd just look for "www", "http", anything.[a-z]{2,3} etc. If only want to detect the url and don't need to actually extract it, this should work fine. – MightyPork Jul 31 '13 at 20:56
  • 1
    @mightypork what if one of the strings contains text talking about the HTTP protocol? or `www` used to represent laughter? i'm not sure going that simple is the best approach here. there's nothing wrong with using the full regex; this one isn't very complicated. – user428517 Jul 31 '13 at 20:59
  • I think you need to bettern define what constitutes a URL. In your case it seems to be either of the form www.somedomain.tld or http(s)://somedomain.tld. What about http(s)://www.somedomain.tld? What about http(s)://subdomina.domain.tld? What about subdomain.domain.tld? What about IP addresses for host names instead of domains? What about other protocols besides http? What about non-standard ports? If you are looking to just flag linkable URL's perhaps you just need to search for `://`, otherwise there are a whole lot of other combinations you haven't considered. – Mike Brant Jul 31 '13 at 21:05
  • **Read this:** http://stackoverflow.com/questions/17900004/turn-plain-text-urls-into-active-links-using-php/17900021#17900021 – Maciej A. Czyzewski Jul 31 '13 at 21:08
  • I doubt this can really be done without actually flagging ALL the postings. By looking at your example, a URL is "letters" + "a dot" + "letters". Anything that conforms to this pattern will be flagged as a URL, especially stuff like "I paid 29.99 dollars" - 29.99 being the domain name here. Or anything where the user didn't bother pressing space before or after the dot. On the other hand, it is completely easy to inject URLs that are not detected (by adding spaces), or just let the user do the work: "Search for 'nasty keywords' on google". – Sven Jul 31 '13 at 21:14
  • I need something. I'm not familiar with Regex and am working to a deadline. I spouse it would be nice to check for valid extensions buts it more about time. On top of that I'm checking for `thisismysiteDOTcom`. So I expect many to be flagged and if it becomes an issue we will re-access – Shane Jul 31 '13 at 22:08
  • Be careful, this regex also catches email addresses as their domain names match domain capture criteria. – Martin Feb 23 '20 at 13:59

2 Answers2

2

In the regex, change {?} to just ?. Then it will work. No idea what {?} is supposed to mean (I've never seen anything like that).

Your regex will work fine for some URLs, but you should be aware that URLs can be much more complicated than you might assume, and a regex that can match every URL is VERY complex. You might want to look up a better regex—you only need one complicated enough to handle the sorts of URLs you're expecting to match.

user428517
  • 4,132
  • 1
  • 22
  • 39
0

Just to add a little work on this specific question;

I took the original Regex as given by the OP and carried out some tweaks to it: This is NOT perfect but does improve upon the original.

  • Added a netagive lookahead to avoid domains beginning with @ (such as email addresses)
  • removed the incorrect {?}
  • Made the http or www a requirement rather than optional.
  • added _ and - characters to accepted URL character set ( I know this concept overall can be greatly expanded upon ).

so;

#(?<!@)(www\.|https?:\/\/)[a-z0-9-_]{2,254}\.[a-z0-9]{2,4}(\S*)#gi

Example:

check out my facebook www.prop-ERty-bg.ru/11be check out my facebook www.property-bg.ru/11be horsae@microsoft.com

catches both www.property-bg.ru/11b but avoids the email address. See it in action.

Martin
  • 22,212
  • 11
  • 70
  • 132