-1

Possible Duplicate:
RegEx to return 'href' attribute of 'link' tags only?

This is my string, I want to pull out the link from href="pull this out" and the text in between the tags using C# Regex. Not sure how to do it.

<a href="http://msdn.microsoft.com/en-us/library/Aa538627.aspx" onclick="trackClick(this, '117', 'http\x3a\x2f\x2fmsdn.microsoft.com\x2fen-us\x2flibrary\x2fAa538627.aspx', '15');">ToolStripItemOwnerCollectionUIAdapter.GetInsertingIndex Method ...</a>
Community
  • 1
  • 1
Fumiko
  • 9
  • 1
  • 2
  • 5
    Don't use regex to parse HTML. – hsz May 06 '11 at 17:21
  • 1
    There are dozens of regex questions already and the answer to yours is a simple search away. You should be able to easily adapt this answer to your problem: http://stackoverflow.com/questions/268338/regex-to-return-href-attribute-of-link-tags-only – Timothy Strimple May 06 '11 at 17:23
  • Look up the HTML Agility Pack. And try searching SO for "HTML Parsing C#" you'll get lots of questions with this exact answer. – AllenG May 06 '11 at 17:23

2 Answers2

4

Don't use regex to parse HTML(as @hsz mentioned). See why: RegEx match open tags except XHTML self-contained tags. Instead of it you could use HTML parser like HtmlAgilityPack for this:

var html = @"<a href=""http://msdn.microsoft.com/en-us/library/Aa538627.aspx"" onclick=""trackClick(this, '117', 'http\x3a\x2f\x2fmsdn.microsoft.com\x2fen-us\x2flibrary\x2fAa538627.aspx', '15');"">ToolStripItemOwnerCollectionUIAdapter.GetInsertingIndex Method ...</a>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);

var link = document.DocumentNode.SelectSingleNode("//a");
if (link != null)
{
    var href = link.Attributes["href"].Value;
    var innerText = link.InnerText;
}

Now href contains http://msdn.microsoft.com/en-us/library/Aa538627.aspx; innerText (AKA the string between tags) contains ToolStripItemOwnerCollectionUIAdapter.GetInsertingIndex Method ....

Isn't it easier than regex?

Community
  • 1
  • 1
Oleks
  • 31,955
  • 11
  • 77
  • 132
2

This shows how to do what you are looking for: C# Scraping HTML Links

Here is the code example from that page:

using System.Collections.Generic;
using System.Text.RegularExpressions;

public struct LinkItem
{
    public string Href;
    public string Text;

    public override string ToString()
    {
    return Href + "\n\t" + Text;
    }
}

static class LinkFinder
{
    public static List<LinkItem> Find(string file)
    {
    List<LinkItem> list = new List<LinkItem>();

    // 1.
    // Find all matches in file.
    MatchCollection m1 = Regex.Matches(file, @"(<a.*?>.*?</a>)",
        RegexOptions.Singleline);

    // 2.
    // Loop over each match.
    foreach (Match m in m1)
    {
        string value = m.Groups[1].Value;
        LinkItem i = new LinkItem();

        // 3.
        // Get href attribute.
        Match m2 = Regex.Match(value, @"href=\""(.*?)\""",
        RegexOptions.Singleline);
        if (m2.Success)
        {
        i.Href = m2.Groups[1].Value;
        }

        // 4.
        // Remove inner tags from text.
        string t = Regex.Replace(value, @"\s*<.*?>\s*", "",
        RegexOptions.Singleline);
        i.Text = t;

        list.Add(i);
    }
    return list;
    }
}
Tim Hobbs
  • 2,017
  • 17
  • 24