2

I have written following code to parse hyperlinks from a given page.

    WebClient web = new WebClient();
    string html = web.DownloadString("http://www.msdn.com");
    string[] separators = new string[] { "<a ", ">" };
    List<string> hyperlinks= html.Split(separators, StringSplitOptions.None).Select(s =>
    {
        if (s.Contains("href"))
            return s;
        else
            return null;
    }).ToList();

Although string split still has to be tweaked to return urls perfectly. My question is there some Data Structure, something on the line of XmlReader or so, which could read HTML strings efficiently.

Any suggestion for improving above code would also be helpful.

Thanks for your time.

Abhijeet
  • 13,562
  • 26
  • 94
  • 175

4 Answers4

2

try HtmlAgilityPack

        HtmlWeb hw = new HtmlWeb();
        HtmlDocument doc = hw.Load("http://www.msdn.com");
        foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
        {
            Console.WriteLine(link.GetAttributeValue("href", null));         
        }

this will print out every link on your URL.

if you want to store the links in a list:

 var linkList = doc.DocumentNode.SelectNodes("//a[@href]")
               .Select(i => i.GetAttributeValue("href", null)).ToList();
Thousand
  • 6,562
  • 3
  • 38
  • 46
1

You should be using a parser. The most widely used one is HtmlAgilityPack. Using that, you can interact with the HTML as a DOM.

Kirk Woll
  • 76,112
  • 22
  • 180
  • 195
1

Assuming you're dealing with well formed XHTML, you could simply treat the text as an XML document. The framework is loaded with features to do exactly what you're asking.

http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.aspx

Does .NET framework offer methods to parse an HTML string?

Community
  • 1
  • 1
Doug
  • 37
  • 4
0

refactored,

        var html = new WebClient().DownloadString("http://www.msdn.com");
        var separators = new[] { "<a ", ">" };
        html.Split(separators, StringSplitOptions.None).Select(s => s.Contains("href") ? s : null).ToList();
Rahul Rumalla
  • 333
  • 1
  • 9