Complete replacement
After some torqueing around with this, I'm posting this solution. Use it or don't, its more for my current or future reference. Amazingly, just the tag-att-val portion covers almost all use cases. Still, regex is not recommended for parsing html. But if used, it should be fairly accurate, which this is.
A C# code sample can be found here - http://ideone.com/TBxXm
It was debugged in VS2008 using the source page from CNN.com, then working copy pasted to ideone for a permalink.
Here is a mildly commented regex
<a
(?=\s)
# Optional preliminary att-vals (should prevent overruns)
(?:[^>"']|"[^"]*"|'[^']*')*?
# HREF, the attribute we're looking for
(?<=\s) href \s* =
# Quoted attr value (only)
# (?> \s* (['"]) (.*?) \1 )
# ---------------------------------------
# Or,
# Unquoted attr value (only)
# (?> (?!\s*['"]) \s* ([^\s>]*) (?=\s|>) )
# ---------------------------------------
# Or,
# Quoted/unquoted attr value (empty-unquoted value is allowed)
(?: (?> \s* (['"]) (?<URL>.*?) \1 )
| (?> (?!\s*['"]) \s* (?<URL>[^\s>]*) (?=\s|>) )
)
# Optional remaining att-vals
(?> (?:".*?"|'.*?'|[^>]?)+ )
# Non-terminated tag
(?<!/)
>
(?<TEXT>.*?)
</a \s*>
and here, as it exists in a C# source
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
string input = @"
<a asdf = href= >BLANK</a>
<a href= a""'tz target=_self >ATZ</a>
<a href=/2012/02/26/world/meast/iraq-missing-soldier-id/index.html?hpt=hp_bn1 target=""_self"">Last missing U.S. soldier in Iraq ID'd</a>
<a id=""weatherLocBtn"" href=""javascript:MainLocalObj.Weather.checkInput('weather',document.localAllLookupForm.inputField.value);""><span>Go</span></a>
<a href=""javascript:CNN_handleOverlay('profile_signin_overlay')"">Log in</a>
<a no='href' here> NOT FOUND </a>
<a this href= is_ok > OK </a>
";
string regex = @"
<a
(?=\s)
(?:[^>""']|""[^""]*""|'[^']*')*?
(?<=\s) href \s* =
(?: (?> \s* (['""]) (?<URL>.*?) \1 )
| (?> (?!\s*['""]) \s* (?<URL>[^\s>]*) (?=\s|>) )
)
(?> (?:"".*?""|'.*?'|[^>]?)+ )
(?<!/)
>
(?<TEXT>.*?)
</a \s*>
";
string output = Regex.Replace(input, regex, "${TEXT} [${URL}]",
RegexOptions.IgnoreCase |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);
Console.WriteLine(input+"\n------------\n");
Console.WriteLine(output);
}
}
}
with output
<a asdf = href= >BLANK</a>
<a href= a"'tz target=_self >ATZ</a>
<a href=/2012/02/26/world/meast/iraq-missing-soldier-id/index.html?hpt=hp_bn1 target="_self">Last missing U.S. soldier in Iraq ID'd</a>
<a id="weatherLocBtn" href="javascript:MainLocalObj.Weather.checkInput('weather',document.localAllLookupForm.inputField.value);"><span>Go</span></a>
<a href="javascript:CNN_handleOverlay('profile_signin_overlay')">Log in</a>
<a no='href' here> NOT FOUND </a>
<a this href= is_ok > OK </a>
------------
BLANK []
ATZ [a"'tz]
Last missing U.S. soldier in Iraq ID'd [/2012/02/26/world/meast/iraq-missing-soldier-id/index.html?hpt=hp_bn1]
<span>Go</span> [javascript:MainLocalObj.Weather.checkInput('weather',document.localAllLookupForm.inputField.value);]
Log in [javascript:CNN_handleOverlay('profile_signin_overlay')]
<a no='href' here> NOT FOUND </a>
OK [is_ok]
Cheers!