1

I have a string variable that contains the entire HTML of a web page. The web page would contain links to other websites. I would like to create a list of all hrefs (webcrawler like ). What is the best possible way to do it ? Will using any extension function help ? what about using Regex ?

Thanks in Advance

Ananth
  • 10,330
  • 24
  • 82
  • 109

5 Answers5

3

Use a DOM parser such as the HTML Agility Pack to parse your document and find all links.

There's a good question on SO about how to use HTML Agility Pack available here. Here's a simple example to get you started:

string html = "your HTML here";

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();

doc.LoadHtml(html);

var links = doc.DocumentNodes.DescendantNodes()
   .Where(n => n.Name == "a" && n.Attributes.Contains("href")
   .Select(n => n.Attributes["href"].Value);
Community
  • 1
  • 1
Donut
  • 110,061
  • 20
  • 134
  • 146
  • @ Donut : Thanks for enlightening me about HTML Agility Pack..I had never used it before. Iam now exploring it. – Ananth Jun 18 '11 at 05:08
1

I think you'll find this answers your question to a T

http://msdn.microsoft.com/en-us/library/t9e807fx.aspx

:)

The Evil Greebo
  • 7,013
  • 3
  • 28
  • 55
1

I would go with Regex.

        Regex exp = new Regex(
            @"{href=}*{>}",
            RegexOptions.IgnoreCase);
        string InputText; //supply with HTTP
        MatchCollection MatchList = exp.Matches(InputText);
therealmitchconnors
  • 2,732
  • 1
  • 18
  • 36
1

Try this Regex (should work):

var matches = Regex.Matches (html, @"href=""(.+?)""");

You can go through the matches and extract the captured URL.

Tim Rogers
  • 21,297
  • 6
  • 52
  • 68
1

Have you looked into using HTMLAGILITYPACK? http://htmlagilitypack.codeplex.com/

With this you can simply us XPATH to get all of the links on the page and put them into a list.

private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
    List<string> hrefTags = new List<string>();

    foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
    {
        HtmlAttribute att = link.Attributes["href"];
        hrefTags.Add(att.Value);
    }

    return hrefTags;
}

Taken from another post here - Get all links on html page?

Community
  • 1
  • 1
EvanGWatkins
  • 1,427
  • 6
  • 23
  • 52