2

Is it possible to search for and remove URLs from a string in PHP. Talking about the actual text here not the HTML. Example to remove:

mywebsite.com
http://mywebsite.org
www.mywebsite.co.uk
www.my-web-site.net
sub.mywebsite.edu
etc

My issue is users submitting a description field and using it promote their own URLs. I'm not sure if its possible without generating too many false positives. I've thought about detecting the http:// or www. but that doesn't stop links like mywebsite.com

Alex
  • 6,497
  • 11
  • 47
  • 58
  • See http://stackoverflow.com/questions/910912/extract-urls-from-text-in-php. This link may not solve your problem, but there's some information in the answers you may find useful. – Herbert Oct 14 '11 at 15:05
  • The most effective way to find URLs (whether encoded as www dot place dot com or any other way) is to use the human eyes and brain - involve the community, if at all possible. – Code Jockey Oct 14 '11 at 16:07

3 Answers3

1

This regex seems to do the trick:

!\b(((ht|f)tp(s?))\://)?(www.|[a-z].)[a-z0-9\-\.]+\.(com|edu|gov|mil|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-z0-9\.\,\;\?\\'\\\\\+&%\$#\=~_\-]+))*\b!i

It is a slight modification of this regex from Regular Expression Library.

I realize it’s a bit overwhelming, but that's to be expected when searching for URLs. Nevertheless, it matches everything on your list.

Alternatively, you could loop through each word in the description and use parse_url() to see how the word breaks down. I’ll leave the criteria for determining if it's a url to you. There’s still the potential for false positives, but they could be greatly reduced. Combined with Andrew’s idea of flagging questionable content for moderation, it could be a workable solution.

Herbert
  • 5,698
  • 2
  • 26
  • 34
  • @Code Jockey: add it to the piped list `(com|edu|gov|...|ca|uk|travel)` – Herbert Oct 14 '11 at 15:57
  • 1
    This also doesn't filter out a lot of the URL shorteners out there (bit.ly, goo.gl, etc...) – Code Jockey Oct 14 '11 at 15:58
  • I have yet to find the _perfect_ regex for matching urls. I'd be interested in seeing it if anyone has. – Herbert Oct 14 '11 at 16:03
  • such an expression would test the limits of a Cray supercomputer, but I'm sure it's _technically_ possible - I'm just picking nits! – Code Jockey Oct 14 '11 at 16:05
  • I can live without the URL shorteners. I'm just trying to stop the blatant piss taking. For example we have had stuff like; "Don't buy here, save money and come direct to our store www.douchebags.com" – Alex Oct 14 '11 at 16:09
  • @Alex: I just came across this [PHP Bayesian Spam Filter](http://archive.atomicmpc.com.au/forums.asp?s=2&c=10&t=4466). I don't know anything about it... _yet_, but it may be worth looking into. See also [Class: Bayesian Spam Filter](http://www.phpclasses.org/package/4236-PHP-Detect-spam-in-text-using-Bayesian-techniques.html) – Herbert Oct 14 '11 at 16:45
  • It may be sufficient to test for dots that appear at the start or in the middle of words. (i.e. not a period.) `\.\b` Since you want to search through a _description_, I can't think of any reason a period would be in any other position than the end of a sentence. – Herbert Oct 15 '11 at 06:01
0

You could try something that looks for .TLD, where TLD is any existing top-level domain, but that may result in too many false positives.

Would it be possible to implement a system where posts containing questionable content need moderation to be posted, but others are posted right away? I'm assuming it's a firm business requirement to disallow this type of content.

Personally, I would tend to just prevent any hyperlinking, and leave it at that. But, it's not my app.

Andrew Barber
  • 39,603
  • 20
  • 94
  • 123
  • I'd do this - but expand on it a little bit so after I've found a matching TLD I'd go backwards in the string a little bit and inspect the string up until I get a non-url character (like space, newline, etc.). Though this doesn't stop people doing the things where they do "example [dot] c0m" –  Oct 14 '11 at 14:33
  • Hyperlinking is already prevented, but users have just moved to making text links instead. I recognise that I'm never going to be able to stop the most determined linker (the example [dot] c0m) but would like to stop the casual example.com – Alex Oct 14 '11 at 14:35
  • 2
    Another option (depending upon your primary user base and their level of activity and cooperation) is a flag/vote down button, which can either get a moderator's attention, or hide/delete the comment after so many votes( or both! - though this might take more effort to implement, obviously) – Code Jockey Oct 14 '11 at 16:01
0

You can easily use a regex to find the URLs, then specify what to replace them with using PHP's function preg_replace.

http://daringfireball.net/2010/07/improved_regex_for_matching_urls

Edit: Since this is user submitted data, you might want to do some validation before you store the "description" field, and check to see if it contains a URL. If it does, you can prevent the user from saving the form.

For this, you can use preg_match, while still using a regex to find a URL.

nickb
  • 59,313
  • 13
  • 108
  • 143