4

I'm looking for libraries to parse HTML to extract links, forms, tags etc.

LGPL or any other commercial development friendly licenses are preferable.

Have you got any experience with one of this libraries? Or could you recommend another similar library?

dr. evil
  • 26,944
  • 33
  • 131
  • 201

1 Answers1

10

The HTML Agility Pack has examples of exactly this type of thing, and uses xpath for familiar queries - for example (from home page), to find all links is simply:

foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")) {
    //...
}

EDIT

As of 6/19/2012, the code above, as well as the only code sample shown on HTML Agility Pack Examples page won't work. Just needs slight tweaking as shown below.

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
  HtmlAttribute att = link.Attributes["href"];
  att.Value = Foo(att); // fix the link
}
doc.Save("file.htm");
JohnB
  • 18,046
  • 16
  • 98
  • 110
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 2
    HTML Agility Pack is awesome, I also recommend it. – Matthew Olenik Mar 17 '09 at 08:13
  • 2
    Agreed. We used this in a production environment, where we parsed approximately 50,000 (X)HTML files/hr, for a couple years straight. Worked great. – core Mar 17 '09 at 09:06
  • Do you have any recommendations for a GPL project? HTML Agility Pack is Ms-Pl which is [incompatible with the GPL](https://www.gnu.org/licenses/license-list.html#ms-pl). – Cole Tobin Dec 22 '14 at 00:59
  • @Cole bleugh; damn I hate that (GPL) license! But: the good news is that GPL victims are entirely "free" to not use that non-GPL library and to go write another under a compatible license :) – Marc Gravell Dec 22 '14 at 16:53