0

How would I replace all href tags in a string such as:

<a href="http://thedomain.com/about">Link Title</a> and <a href="http://anotherlink.com">Another Link</a>

..with the URL placed in brackets after the content of the tag:

Link Title [http:// thedomain.com/about] and Another Link [http://anotherlink.com]

Allow for capital A HREF and capital /A.

This will be used to re-format hyperlinks when sending plain-text email.

RegEx may be used. Similar to: Replace Hyperlink with Plain-Text URL Using REGEX

Community
  • 1
  • 1
Loren
  • 1,273
  • 1
  • 12
  • 14
  • 1
    You've answered the question by yourself: use the regex. So what is your question then? Are you asking for a particular regex to do the job? Please rephrase your question. – SiliconMind Feb 29 '12 at 17:21

3 Answers3

4

This C# regex and replacement regex worked for me in my testing using Expresso. Regex options specify case insensitivity, as you requested, and also to ignore whitespace, which I like to leave in for readability.

using System;
using System.Text.RegularExpressions;

string inputText = "your text here";
string rx = "<a\\s+ .*? href\\s*=\\s*(?:\"|') (?<url>.*?) (?:\"|') .*?> (?<anchorText>.*?) \\</a>";
Regex regex = new Regex( rx, RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace );
string regexReplace = "${anchorText} [${url}]";

string result = regex.Replace( inputText, regexReplace );
Darryl
  • 1,531
  • 15
  • 26
1

Complete replacement

After some torqueing around with this, I'm posting this solution. Use it or don't, its more for my current or future reference. Amazingly, just the tag-att-val portion covers almost all use cases. Still, regex is not recommended for parsing html. But if used, it should be fairly accurate, which this is.

A C# code sample can be found here - http://ideone.com/TBxXm
It was debugged in VS2008 using the source page from CNN.com, then working copy pasted to ideone for a permalink.

Here is a mildly commented regex

<a 
  (?=\s) 

  # Optional preliminary att-vals (should prevent overruns)
  (?:[^>"']|"[^"]*"|'[^']*')*?

  # HREF, the attribute we're looking for
  (?<=\s) href \s* =

     # Quoted attr value (only)
     # (?> \s* (['"]) (.*?) \1 )
     # ---------------------------------------
     # Or,
     # Unquoted attr value (only)
     # (?> (?!\s*['"]) \s* ([^\s>]*) (?=\s|>) )
     # ---------------------------------------
     # Or,

  # Quoted/unquoted attr value (empty-unquoted value is allowed)
  (?: (?>             \s* (['"]) (?<URL>.*?)     \1       )
    | (?> (?!\s*['"]) \s*        (?<URL>[^\s>]*) (?=\s|>) )   
  )

  # Optional remaining att-vals
  (?> (?:".*?"|'.*?'|[^>]?)+ )

  # Non-terminated tag
  (?<!/)
>
(?<TEXT>.*?)
</a \s*>

and here, as it exists in a C# source

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;


namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            string input = @"
               <a asdf = href=  >BLANK</a>
               <a href= a""'tz target=_self >ATZ</a>
               <a href=/2012/02/26/world/meast/iraq-missing-soldier-id/index.html?hpt=hp_bn1 target=""_self"">Last missing U.S. soldier in Iraq ID'd</a>
               <a id=""weatherLocBtn"" href=""javascript:MainLocalObj.Weather.checkInput('weather',document.localAllLookupForm.inputField.value);""><span>Go</span></a>
               <a href=""javascript:CNN_handleOverlay('profile_signin_overlay')"">Log in</a>
               <a no='href' here> NOT FOUND </a>
               <a this href= is_ok > OK </a>
            ";
            string regex = @"
               <a 
                 (?=\s) 
                 (?:[^>""']|""[^""]*""|'[^']*')*?
                 (?<=\s) href \s* =
                 (?: (?>              \s* (['""]) (?<URL>.*?)     \1       )
                   | (?> (?!\s*['""]) \s*         (?<URL>[^\s>]*) (?=\s|>) )   
                 )
                 (?> (?:"".*?""|'.*?'|[^>]?)+ )
                 (?<!/)
               >
               (?<TEXT>.*?)
               </a \s*>
            ";
            string output = Regex.Replace(input, regex, "${TEXT} [${URL}]",
                                RegexOptions.IgnoreCase |
                                RegexOptions.Singleline |
                                RegexOptions.IgnorePatternWhitespace);

            Console.WriteLine(input+"\n------------\n");
            Console.WriteLine(output);
        }
    }
}

with output

           <a asdf = href=  >BLANK</a>
           <a href= a"'tz target=_self >ATZ</a>
           <a href=/2012/02/26/world/meast/iraq-missing-soldier-id/index.html?hpt=hp_bn1 target="_self">Last missing U.S. soldier in Iraq ID'd</a>
           <a id="weatherLocBtn" href="javascript:MainLocalObj.Weather.checkInput('weather',document.localAllLookupForm.inputField.value);"><span>Go</span></a>
           <a href="javascript:CNN_handleOverlay('profile_signin_overlay')">Log in</a>
           <a no='href' here> NOT FOUND </a>
           <a this href= is_ok > OK </a>

------------

           BLANK []
           ATZ [a"'tz]
           Last missing U.S. soldier in Iraq ID'd [/2012/02/26/world/meast/iraq-missing-soldier-id/index.html?hpt=hp_bn1]
           <span>Go</span> [javascript:MainLocalObj.Weather.checkInput('weather',document.localAllLookupForm.inputField.value);]
           Log in [javascript:CNN_handleOverlay('profile_signin_overlay')]
           <a no='href' here> NOT FOUND </a>
            OK  [is_ok]

Cheers!

  • Thanks! Please post a C# version too. Much appreciated. – Loren Feb 29 '12 at 17:50
  • Dig it! And nice ideone.com test!! I chose Darryl's answer so newbies will be steered toward Expresso - a good way to find a regex for different cases. "using System.Text.RegularExpressions;" – Loren Mar 01 '12 at 01:15
  • @Loren - If it works for you, then great! It failed all over the place for me. Regex newbies and html won't get along that well. No problem, html regex is just a fun excersise - futule, but fun. –  Mar 01 '12 at 23:04
1

It is generally not a good idea to try parsing html with regex because of the complexity of finding a regex to meet all possible cases. Of course if you need to parse a small string then its probably acceptable.

Better option would be to use a parser instead like http://roberto.open-lab.com/2010/03/04/a-html-sanitizer-for-c/

Also do see the answers here : RegEx match open tags except XHTML self-contained tags

Edit

Ok here is 1 way using the htmlAgilityPack:

static void Main(string[] args)
    {
        HtmlDocument htmlDoc = new HtmlDocument();    
        htmlDoc.Load(@"c:\test.html");    
        var listofHyperLinkTags = from hyperlinks in htmlDoc.DocumentNode.Descendants()
                          where hyperlinks.Name == "a" &&
                               hyperlinks.Attributes["href"] != null
                          select new
                          {
                              Address = hyperlinks.Attributes["href"].Value,
                              LinkTitle = hyperlinks.InnerText
                          };

        foreach(var linkDetail in listofHyperLinkTags)
            Console.WriteLine(linkDetail.LinkTitle + "[" + linkDetail.Address + "]");

        Console.Read();
    }

If LINQ is not an option, use the XPath expression

var anchorTags = htmlDoc.DocumentNode.SelectNodes("//a");

foreach (var tag in anchorTags)
{
}

If you want to modify the document then use something like (there may be better ways)

var parentNode = tag.ParentNode;

HtmlNode node = htmlDoc.CreateElement("br");

node.InnerHtml = tag.InnerText + "[" + tag.Attributes["href"].Value + "]";
parentNode.RemoveChild(tag);
parentNode.AppendChild(node); 
Community
  • 1
  • 1
NoviceProgrammer
  • 3,347
  • 1
  • 22
  • 32
  • Please provide a code sample using a parser within C# to move the URL into a set of brackets following the link text. – Loren Feb 29 '12 at 18:21
  • Excellent! Great approach for custom parsing when regex would get too verbose. – Loren Mar 01 '12 at 00:49
  • I do love the HtmlAgilityPack, but I will still use regular expressions for simple, well-defined HTML parsing. – Darryl Mar 01 '12 at 16:10