1

I have a string that contains html. I want to get all href value from hyperlinks using C#.
Target String
<a href="~/abc/cde" rel="new">Link1</a>
<a href="~/abc/ghq">Link2</a>

I want to get values "~/abc/cde" and "~/abc/ghq"

coure2011
  • 40,286
  • 83
  • 216
  • 349
  • 3
    [obligatory reference](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) :) – balpha Apr 12 '10 at 16:54
  • 1
    @balpha: What? That absolutely does not apply here. You can use regex to get the href of an open tag and not even bother with closing tags. – Platinum Azure Apr 12 '10 at 17:01
  • @Platinum: http://en.wikipedia.org/wiki/Emoticon – balpha Apr 13 '10 at 05:19
  • @balpha: Well, I'm glad you have a sense of humor, but given how it has also appeared in EVERY answer below, you can understand why I might think people just have this knee-jerk "omg never use regex to parse HTML" response, emoticon or no. – Platinum Azure Apr 13 '10 at 18:36
  • @Platinum Azure: No harm -- I just love to mention that answer, because if you've read it once, it will stick in your head and haunt you whenever you start markup parsing with regexes. That doesn't mean it's always wrong, but having that answer in your head makes you at least think about it. I sometimes analyze HTML without a real parser, too, but I usually put a comment `# the center cannot hold` before it :) – balpha Apr 14 '10 at 15:25

3 Answers3

4

Use the HTML Agility Pack for parsing HTML. Right on their examples page they have an example of parsing some HTML for the href values:

 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
 {
    HtmlAttribute att = link["href"];

    // Do stuff with attribute value
 }
womp
  • 115,835
  • 26
  • 236
  • 269
2

Using a regex to parse HTML is not advisable (think of text in comments etc.).

That said, the following regex should do the trick, and also gives you the link HTML in the tag if desired:

Regex regex = new Regex(@"\<a\s[^\<\>]*?href=(?<quote>['""])(?<href>((?!\k<quote>).)*)\k<quote>[^\>]*\>(?<linkHtml>((?!\</a\s*\>).)*)\</a\s*\>", RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture);
for (Match match = regex.Match(inputHtml); match.Success; match=match.NextMatch()) {
  Console.WriteLine(match.Groups["href"]);
}
Lucero
  • 59,176
  • 9
  • 122
  • 152
  • Thats exactly what i was looking for, how the groups thing is working? – coure2011 Apr 12 '10 at 18:09
  • I am trying same thing for img src but its not working, any idea? Regex srcs = new Regex(@"\]*?src=(?['""])(?((?!\k).)*)\k[^\>]*\>(?((?!\).)*)\", RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture); – coure2011 Apr 12 '10 at 18:58
  • The `img` tag is an empty tag, so you have no contents. Try this: `\]*?src=(?['""])(?((?!\k).)*)\k[^\>]*\>` – Lucero Apr 12 '10 at 19:26
1

Here is a snippet of the regex (use IgnoreWhitespace option):

(?:<)(?<Tag>[^\s/>]+)       # Extract the tag name.
(?![/>])                    # Stop if /> is found
# -- Extract Attributes Key Value Pairs  --

((?:\s+)             # One to many spaces start the attribute
 (?<Key>[^=]+)       # Name/key of the attribute
 (?:=)               # Equals sign needs to be matched, but not captured.

(?([\x22\x27])              # If quotes are found
  (?:[\x22\x27])
  (?<Value>[^\x22\x27]+)    # Place the value into named Capture
  (?:[\x22\x27])
 |                          # Else no quotes
   (?<Value>[^\s/>]*)       # Place the value into named Capture
 )
)+                  # -- One to many attributes found!

This will give you every tag and you can filter out what is needed and target the attribute you want.

I've written more about this in my blog (C# Regex Linq: Extract an Html Node with Attributes of Varying Types).

casperOne
  • 73,706
  • 19
  • 184
  • 253
ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122