Regex URL Replace, ignore Images and existing Links

Question

I have a very good regex that works and is able to replace urls in a string to clickable once.

string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&amp;~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";

Now, how can I tell it to ignore already clickable links and images?

So it ignores below strings:

<a href="http://www.someaddress.com">Some Text</a>

<img src="http://www.someaddress.com/someimage.jpg" />

Example:

The website www.google.com, once again <a href="http://www.google.com">www.google.com</a>, the logo <img src="http://www.google.com/images/logo.gif" />

Result:

The website <a href="http://www.google.com">www.google.com</a>, once again <a href="http://www.google.com">www.google.com</a>, the logo <img src="http://www.google.com/images/logo.gif" />

Full HTML Parser code:

string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&amp;~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";
Regex r = new Regex(regex, RegexOptions.IgnoreCase);

text = r.Replace(text, "<a href=\"$1\" title=\"Click to open in a new window or tab\" target=\"&#95;blank\" rel=\"nofollow\">$1</a>").Replace("href=\"www", "href=\"http://www");

return text;

good, hard to read, hard do maintain, easy with a HtmlParser.. — Adrian Iftode, Feb 21 '12 at 08:24
I answered this already [here](http://stackoverflow.com/a/8833696/626273) — stema, Feb 21 '12 at 08:48
possible duplicate of [Regex string issue in making plain text urls clickable](http://stackoverflow.com/questions/8833588/regex-string-issue-in-making-plain-text-urls-clickable) — stema, Feb 21 '12 at 08:49
Yes, I am trying to parse HTML, I just updated the Question and pasted all the code. — Cindro, Feb 21 '12 at 10:33
... don't parse HTML with regex: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Sam Greenhalgh, Feb 21 '12 at 12:55

score 2 · Accepted Answer · edited May 23 '17 at 10:34

2

First I'll post it the obligatory link if no one else will. RegEx match open tags except XHTML self-contained tags

How about using a negative lookahead/behind for " like this:

string regex = @"(?<!"")((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])(?!"")";

edited May 23 '17 at 10:34

Community

1
1

answered Feb 21 '12 at 12:54

Sam Greenhalgh

5,952
21
37

1

We should really stop providing workarounds after posting to the obligatory reference... – jessehouwing Feb 21 '12 at 17:05
Agreed, but I wouldn't want to post an unhelpful comments as an answer either. – Sam Greenhalgh Feb 21 '12 at 23:05
This worked for me: `(?<!\w?="")(((http|https|ftp|news|file)+://)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])` – Cindro Feb 22 '12 at 22:39

score 1 · Answer 2 · edited May 23 '17 at 10:34

Check out: Detect email in text using regex, just replace the regex for links, it will never replace a link inside a tag, only in contents.

http://htmlagilitypack.codeplex.com/

Something like:

string textToBeLinkified = "... your text here ...";
const string regex = @"((www\.|(http|https|ftp|news|file)+\:\/\/)[_.a-z0-9-]+\.[a-z0-9\/_:@=.+?,##%&amp;~-]*[^.|\'|\# |!|\(|?|,| |>|<|;|\)])";
Regex urlExpression = new Regex(regex, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(textToBeLinkified);

var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]") ?? new HtmlNodeCollection();
foreach (var node in nodes)
{
    node.InnerHtml = urlExpression.Replace(node.InnerHtml, @"<a href=""$0"">$0</a>");
}
string linkifiedText = doc.DocumentNode.OuterHtml;

Regex URL Replace, ignore Images and existing Links

2 Answers2

Linked