0

I want to detect emails in text format so that I can put an anchor tag over them with mailto tag in anchor. I have the regex for it but the code also detects emails which are already encapsulated by anchor tag or is inside the anchor tag mailto parameter.

My regex is:

([\w-]+(\.[\w-]+)*@([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?)

But it detects 3 matches in the following sample text:

ttt <a href='mailto:someone@example.com'>someemail@mail.com</a> abc email@email.com

I want only email@email.com to be matched by the regex.

jessehouwing
  • 106,458
  • 22
  • 256
  • 341
Computer User
  • 2,839
  • 4
  • 47
  • 69

2 Answers2

2

Very similar to my previous answer to your other question, try this

(?<!(?:href=['"]mailto:|<a[^>]*>))(\b[\w-]+(\.[\w-]+)*@([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?)

The only thing that is really different is the word boundary \b before the start of the email.

See a similar expression here on Regexr, its not exactly the same, because Regexr does not support alternations and infinite length in the lookbehind.

Community
  • 1
  • 1
stema
  • 90,351
  • 20
  • 107
  • 135
  • One more question, your regex does not work when there is double quotes {"} in the anchor tag like: href="somelink" It works well for single quote in href in anchor tag. for example: href='somelink' Can you help in editing the lookbehind so that is covers both single quote {'} and double quote {"} – Computer User Jan 19 '12 at 14:27
2

It's a better idea to leave the parsing of the HTML to something suitable for that (such as the HtmlAgilityPack) and combine that with regex to update the text nodes:

    string sContent = "ttt <a href='mailto:someone@example.com'>someemail@mail.com</a> abc email@email.com";
    string sRegex = @"([\w-]+(\.[\w-]+)*@([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?)";
    Regex Regx = new Regex(sRegex, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture);

    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(sContent);

    var nodes = doc.DocumentNode.SelectNodes("//text()[not(ancestor::a)]");
    foreach (var node in nodes)
    {
        node.InnerHtml = Regx.Replace(node.InnerHtml, @"<a href=""mailto:$0"">$0</a>");
    }
    string fixedContent = doc.DocumentNode.OuterHtml;

I notice you've posted the same question other forums as well, but haven't appointed an answer in any of them.

jessehouwing
  • 106,458
  • 22
  • 256
  • 341