0

I need to replace all urls in the text IF they are not put into '...' HTML tags yet.

The unconditional way to replace is described here: Recognize URL in plain text.

Here is my implementation of it:

    private static readonly Regex UrlMatcherRegex = new Regex(@"\b(?:(?:https?|ftp|file)://|www\.|ftp\.)(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])", RegexOptions.Compiled | RegexOptions.IgnoreCase);

    public static string GetProcessedMessage(this INews news)
    {
        string res = UrlMatcherRegex.Replace(news.Mess, ReplaceHrefByAnchor);

        return res;
    }

    private static string ReplaceHrefByAnchor(Match match)
    {
        string href = match.Groups[0].Value;
        return string.Format("<a href=\"{0}\" target=\"_blank\">{0}</a>", href);
    }

But how can I ignore those URLs which are already formatted properly?

Please advise.

P.S. I'm using ASP.NET 4.5

P.P.S. I could imagine that one of the solutions could be enhance regex to check for "

Community
  • 1
  • 1
Budda
  • 18,015
  • 33
  • 124
  • 206

1 Answers1

0

From my point of view there are 2 solutions:

  1. Use special libraries to parse your HTML document (if it's proper HTML document). For example, you can use XDocument.Parse. After parsing the document you can easily find out if the element is normal HTML "a" tag or it's just a plain text.
  2. You can suggest that if the link is already formatted properly - it will have "href" prefix before the URL. So, in your regex you can search for all links not having "href=" before them. This could be done either via C# or via regex negative look-around functionality. You can see an example here: Regular expression to match string not containing a word?
Community
  • 1
  • 1
WhiteAngel
  • 2,594
  • 2
  • 21
  • 35