HTML Parsing Libraries for .NET

Question

I'm looking for libraries to parse HTML to extract links, forms, tags etc.

LGPL or any other commercial development friendly licenses are preferable.

Have you got any experience with one of this libraries? Or could you recommend another similar library?

score 10 · Accepted Answer · edited Jun 19 '12 at 15:25

10

The HTML Agility Pack has examples of exactly this type of thing, and uses xpath for familiar queries - for example (from home page), to find all links is simply:

foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")) {
    //...
}

EDIT

As of 6/19/2012, the code above, as well as the only code sample shown on HTML Agility Pack Examples page won't work. Just needs slight tweaking as shown below.

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
  HtmlAttribute att = link.Attributes["href"];
  att.Value = Foo(att); // fix the link
}
doc.Save("file.htm");

edited Jun 19 '12 at 15:25

JohnB

18,046
16
98
110

answered Mar 17 '09 at 08:08

Marc Gravell

1,026,079
266
2,566
2,900

2

HTML Agility Pack is awesome, I also recommend it. – Matthew Olenik Mar 17 '09 at 08:13
2

Agreed. We used this in a production environment, where we parsed approximately 50,000 (X)HTML files/hr, for a couple years straight. Worked great. – core Mar 17 '09 at 09:06
Do you have any recommendations for a GPL project? HTML Agility Pack is Ms-Pl which is [incompatible with the GPL](https://www.gnu.org/licenses/license-list.html#ms-pl). – Cole Tobin Dec 22 '14 at 00:59
@Cole bleugh; damn I hate that (GPL) license! But: the good news is that GPL victims are entirely "free" to not use that non-GPL library and to go write another under a compatible license :) – Marc Gravell Dec 22 '14 at 16:53

HTML Parsing Libraries for .NET

1 Answers1

Linked