21

I have a load of user-submitted content. It is HTML, and may contain URLs. Some of them will be <a>'s already (if the user is good) but sometimes users are lazy and just type www.something.com or at best http://www.something.com.

I can't find a decent regex to capture URLs but ignore ones that are immediately to the right of either a double quote or '>'. Anyone got one?

Blair Conrad
  • 233,004
  • 25
  • 132
  • 111
Nick Locking
  • 2,147
  • 2
  • 26
  • 42

6 Answers6

16

Jan Goyvaerts, creator of RegexBuddy, has written a response to Jeff Atwood's blog that addresses the issues Jeff had and provides a nice solution.

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]

In order to ignore matches that occur right next to a " or >, you could add (?<![">]) to the start of the regex, so you get

(?<![">])\b(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&@#/%=~_|$?!:,.]*[A-Z0-9+&@#/%=~_|$]

This will match full addresses (http://...) and addresses that start with www. or ftp. - you're out of luck with addresses like ars.userfriendly.org...

Vic
  • 1,336
  • 11
  • 30
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
12

This thread is old as the hills, but I came across it while working on my own problem: That is, convert any urls into links, but leave alone any that are already within anchor tags. After a while, this is what has popped out:

(?!(?!.*?<a)[^<]*<\/a>)(?:(?:https?|ftp|file)://|www\.|ftp\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]

With the following input:

http://www.google.com
http://google.com
www.google.com

<p>http://www.google.com<p>

this is a normal sentence. let's hope it's ok.

<a href="http://www.google.com">www.google.com</a>

This is the output of a preg_replace:

<a href="http://www.google.com" rel="nofollow">http://www.google.com</a>
<a href="http://google.com" rel="nofollow">http://google.com</a>
<a href="www.google.com" rel="nofollow">www.google.com</a>

<p><a href="http://www.google.com" rel="nofollow">http://www.google.com</a><p>

this is a normal sentence. let's hope it's ok.

<a href="http://www.google.com">www.google.com</a>

Just wanted to contribute back to save somebody some time.

Matt
  • 782
  • 5
  • 14
11

I made a slight modification to the Regex contained in the original answer:

(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]

which allows for more subdomains, and also runs a more full check on tags. To apply this to PHP's preg replace, you can use:

$convertedText = preg_replace( '@(?<![.*">])\b(?:(?:https?|ftp|file)://|[a-z]\.)[-A-Z0-9+&#/%=~_|$?!:,.]*[A-Z0-9+&#/%=~_|$]@i', '<a href="\0" target="_blank">\0</a>', $originalText );

Note, I removed @ from the regex, in order to use it as a delimiter for preg_replace. It's pretty rare that @ would be used in a URL anyway.

Obviously, you can modify the replacement text, and remove target="_blank", or add rel="nofollow" etc.

Hope that helps.

Hodge
  • 111
  • 1
  • 2
  • I've added an = to the (?<![.*">]) at the start to not break link (non-quoted anchor tags). Nice regex btw :) – Joel Jun 29 '10 at 10:41
  • @Joel: Are you sure you want that lookbehind to mean "Assert that it's impossible to match a dot, an asterisk, a quote or a closing angle bracket before the current position in the string"? – Tim Pietzcker Apr 13 '12 at 16:44
1
if (preg_match('/\b(?<!=")(https?|ftp|file):\/\/[-A-Z0-9+&@#\/%?=~_|!:,.;]*[A-Z0-9+&@#\/%=~_|](?!.*".*>)(?!.*<\/a>)/i', $subject)) {
    # Successful match
} else {
    # Match attempt failed
}
RUX
  • 11
  • 1
1

To skip existing ones just use a look-behind - add (?<!href=") to the beginning of your regular expression, so it would look something like this:

/(?<!href=")http://\S*/

Obviously this isn't a complete solution for finding all types of URLs, but this should solve your problem of messing with existing ones.

Nicole
  • 32,841
  • 11
  • 75
  • 101
  • Thank you, exactly what I was looking for, ended up with `/((?<!href=")https?:\/\/[^\s\<]+)/g` so it will break on conditions where next tag starts just after link ends – harsh zalavadiya May 07 '20 at 13:26
0

Shameless plug: You can look here (regular expression replace a word by a link) for inspiration.

The question asked to replace some word with a certain link, unless there already was a link. So the problem you have is more or less the same thing.

All you need is a regex that matches a URL (in place of the word). The simplest assumption would be like this: An URL (optionally) starts with "http://", "ftp://" or "mailto:" and lasts as long as there are no white-space characters, line breaks, tag brackets or quotes).

Beware, long regex ahead. Apply case-insensitively.

(href\s*=\s*['"]?)?((?:http://|ftp://|mailto:)?[^.,<>"'\s\r\n\t]+(?:\.(?![.<>"'\s\r\n])[^.,!<>"'\s\r\n\t]+)+)

Be warned - this will also match URLs that are technically invalid, and it will recognize things.formatted.like.this as an URL. It depends on your data if it is too insensitive. I can fine-tune the regex if you have examples where it returns false positives.

The regex will produce two match groups. Group 2 will contain the matched thing, which is most likely an URL. Group 1 will either contain an empty string or an 'href="'. You can use it as an indicator that this match occurred inside a href parameter of an existing link and you don't have to do touch that one.

Once you confirm that this does the right thing for you most of the time (with user supplied data, you can never be sure), you can do the rest in two steps, as I proposed it in the other question:

  1. Make a link around every URL there is (unless there is something in match group 1!) This will produce double nested <a> tags for things that have a link already.
  2. Scan for incorrectly nested <a> tags, removing the innermost one
Community
  • 1
  • 1
Tomalak
  • 332,285
  • 67
  • 532
  • 628