Regex : Replace URL unless starts with src=

Question

I'm looking for a regular expression to use in Refex.Replace that allows me to append a url with link elements. The idea is this:

http://www.tenforce.com => <a target='new' href='http://www.tenforce.com'>http://www.tenforce.com</a>

However, the regex is not allowed to do this when the URL is part of a html element, like for an image tag for example. So if we have for example:

<img src="http://www.tenforce.com/logo.jpg" />

It should not be converted using the regex.

The original regex that we used was this one:

@"(http|ftp|https):((\/\/)|(\\\\))[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?";

But this encodes every url it can find into the an a-tag. I don't want it to encode urls when they are prepended with src=\"

So I tried adding [^(src=.)] to it, but this results in normal urls no longer beeing transformed. it doesn't transform the image tags either though.

The code looks like this:

/// <summary>
        /// Extends the text with hyperlinks.
        /// </summary>
        /// <param name="value">The value.</param>
        /// <param name="workspaceId">The workspace id where the user is working in. Used when parsing the wiki links</param>
        /// <returns></returns>
        public static string ExtendWithHyperlinks(string value, int? workspaceId)
        {
            if (value == null) return null;

            const string UrlPattern = @"[^(src=.)](http|ftp|https):((\/\/)|(\\\\))[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?";
            const string FilePattern = @"(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\\w((\.*\w+)|( *\w+))*)+";

            value = Regex.Replace(value, UrlPattern, "<a target='new' href='$0'>$0</a>").Replace(":\\\\", "://");
            value = Regex.Replace(value, FilePattern, "<a target='new' href='file:///$0'>$0</a>");
            value = TemplateParser.Parse(value, workspaceId, Path.GetDirectoryName(Path.GetDirectoryName(Assembly.GetExecutingAssembly().GetName().CodeBase.Remove(0, 8))));
            return value;
        }

score 1 · Accepted Answer · answered Oct 18 '11 at 08:02

1

You could probably do it with a negative lookbehind

(?<!src=['"]?)(http|ftp|https):...

answered Oct 18 '11 at 08:02

Tetaxa

4,375
1
19
25

thanks, this solved the problem. Made a small adjustment though, replaced the **['"]** with just a . – codingbunny Oct 18 '11 at 08:13

score 0 · Answer 2 · edited May 23 '17 at 11:46

0

Effectively, this question is a dupe of many others on SO. The real answer is: don't use Regex to deal with HTML/XML. Use an dedicated HTML parser. HtmlAgilityPack is great, and you won't have to slum it with a tool that is ill-suited to the job.

edited May 23 '17 at 11:46

Community

1
1

answered Oct 18 '11 at 08:10

spender

117,338
33
229
351

Valid, but applicable in this case. The text that is entered can be anything random typed by a user, and we need to convert it to valid HTML so it can be properly displayed. The problem however is when using a tool like ckeditor that automatically injects html in an RTF field, we need to be able to ignore HTML and still encode everything else to HTML – codingbunny Oct 18 '11 at 08:15
1

Fair enough, but downvote for sharing a common wisdom in what appeared to be an appropriate situation seems a little harsh. – spender Oct 18 '11 at 08:21

Regex : Replace URL unless starts with src=

2 Answers2