653

How do I use the HTML Agility Pack?

My XHTML document is not completely valid. That's why I wanted to use it. How do I use it in my project? My project is in C#.

carla
  • 1,970
  • 1
  • 31
  • 44
  • 84
    This question was very helpful to me. – BigJoe714 May 21 '10 at 20:34
  • 26
    Side Note: with a Visual Studio that handles NuGet, you can now right-click "References" and choose "Manage NuGet Packages...", search for "HtmlAgilityPack" and click "Install". Then get right into playing with the code with a using/Import statement. – patridge Jun 28 '11 at 18:12
  • Regarding the above comment by @patridge: I found that I needed to remove and then re-add my reference to the HtmlAgilityPack when first fetching the project from svn via ankhsvn. – Andrew Coonce Jan 16 '13 at 21:08
  • @AndrewCoonce Sounds like the "restore missing packages" option on nuget might be of help with that issue. – Cornelius Sep 27 '13 at 14:37
  • 14
    Anyone looking into HTMLAgilityPack should consider CsQuery, it's a much newer library with a much more modern interface from my experience. For example, the whole code from the first answer can be summed up in CsQuery as `var body = CQ.CreateFromFile(filePath)["body"]`. – Benjamin Gruenbaum Jan 01 '14 at 10:41
  • 2
    @BenjaminGruenbaum: Thumbs up for your CsQuery suggestion - set up in minutes, very easy to use. – Victor Zakharov May 13 '14 at 20:06
  • @BenjaminGruenbaum Don't use csQuery for anything important. There are many large bugs that will return incorrect data. – tic Sep 17 '15 at 20:59
  • @tic I have over a million lines of code of CsQuery using code with over 100 scrapers and it has been pretty flawless so far. Are you sure you understand how contexts work in CsQuery work (the fact the context is "sticky", sort of unlike jQuery). If you find bugs please report them. – Benjamin Gruenbaum Sep 18 '15 at 14:39
  • @BenjaminGruenbaum I have already reported one of them and there remain 2 others open that are quite large bugs for my projects. View the bug tracker on the github page. These bugs have remained unfixed for a long time – tic Sep 19 '15 at 14:48

7 Answers7

366

First, install the HTMLAgilityPack nuget package into your project.

Then, as an example:

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

// There are various options, set as needed
htmlDoc.OptionFixNestedTags=true;

// filePath is a path to a file containing the html
htmlDoc.Load(filePath);

// Use:  htmlDoc.LoadHtml(xmlString);  to load from a string (was htmlDoc.LoadXML(xmlString)

// ParseErrors is an ArrayList containing any errors from the Load statement
if (htmlDoc.ParseErrors != null && htmlDoc.ParseErrors.Count() > 0)
{
    // Handle any parse errors as required

}
else
{

    if (htmlDoc.DocumentNode != null)
    {
        HtmlAgilityPack.HtmlNode bodyNode = htmlDoc.DocumentNode.SelectSingleNode("//body");

        if (bodyNode != null)
        {
            // Do something with bodyNode
        }
    }
}

(NB: This code is an example only and not necessarily the best/only approach. Do not use it blindly in your own application.)

The HtmlDocument.Load() method also accepts a stream which is very useful in integrating with other stream oriented classes in the .NET framework. While HtmlEntity.DeEntitize() is another useful method for processing html entities correctly. (thanks Matthew)

HtmlDocument and HtmlNode are the classes you'll use most. Similar to an XML parser, it provides the selectSingleNode and selectNodes methods that accept XPath expressions.

Pay attention to the HtmlDocument.Option?????? boolean properties. These control how the Load and LoadXML methods will process your HTML/XHTML.

There is also a compiled help file called HtmlAgilityPack.chm that has a complete reference for each of the objects. This is normally in the base folder of the solution.

DaveShaw
  • 52,123
  • 16
  • 112
  • 141
Ash
  • 60,973
  • 31
  • 151
  • 169
  • 11
    Also note that Load accepts a Stream parameter, which is convenient in many situations. I used it for a HTTP stream (WebResponse.GetResponseStream). Another good method to be aware of is HtmlEntity.DeEntitize (part of HTML Agility Pack). This is needed to process entities manually in some cases. – Matthew Flaschen May 11 '09 at 07:34
  • 1
    note: in the latest beta of Html Agility Pack (1.4.0 Beta 2 released Oct 3 2009) the help file has been moved out into a separate download because of dependencies on Sandcastle, DocProject and the Visual Studio 2008 SDK. – rtpHarry Apr 06 '10 at 23:02
  • `SelectSingleNode() ` seems to have been removed a while ago – Chris S Jul 16 '10 at 08:36
  • 3
    No, SelectSingleNode and SelectNodes are definitely still there. I find it a little interesting that it should be htmlDoc.ParseErrors.Count(), not .Count – Mike Blandford Feb 14 '11 at 02:04
  • It is a property. Properties do not require () – Alireza Noori May 07 '11 at 11:21
  • 1
    @MikeBlandford // Partially yes. It seems to be removed(or not existed from the beginning) at PCL version of HtmlAgailityPack. http://www.nuget.org/packages/HtmlAgilityPack-PCL/ – Joon Hong Oct 23 '13 at 08:01
  • htmlDoc.ParseErrors.Count() > 0 replace with htmlDoc.ParseErrors.Any() – mathewsun Jan 23 '16 at 19:36
167

I don't know if this will be of any help to you, but I have written a couple of articles which introduce the basics.

The next article is 95% complete, I just have to write up explanations of the last few parts of the code I have written. If you are interested then I will try to remember to post here when I publish it.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
rtpHarry
  • 13,019
  • 4
  • 43
  • 64
  • 17
    Finally finished that article two years later :) [A straightforward method to detecting RSS and Atom feeds in websites with HtmlAgilityPack](http://runtingsproper.blogspot.co.uk/2012/07/a-straightforward-method-to-detecting.html) – rtpHarry Jul 21 '12 at 09:18
  • 3
    Recently in _Code Project_ it has been released a very good article of HTMLAgilityPack. You can read it [here](http://www.codeproject.com/Articles/691119/Html-Agility-Pack-Massive-information-extraction-f) – Victor Sigler Feb 19 '14 at 19:15
66

HtmlAgilityPack uses XPath syntax, and though many argues that it is poorly documented, I had no trouble using it with help from this XPath documentation: https://www.w3schools.com/xml/xpath_syntax.asp

To parse

<h2>
  <a href="">Jack</a>
</h2>
<ul>
  <li class="tel">
    <a href="">81 75 53 60</a>
  </li>
</ul>
<h2>
  <a href="">Roy</a>
</h2>
<ul>
  <li class="tel">
    <a href="">44 52 16 87</a>
  </li>
</ul>

I did this:

string url = "http://website.com";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
{
  names.Add(node.ChildNodes[0].InnerHtml);
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//li[@class='tel']//a"))
{
  phones.Add(node.ChildNodes[0].InnerHtml);
}
Brendan Gooden
  • 1,460
  • 2
  • 21
  • 40
Kent Munthe Caspersen
  • 5,918
  • 1
  • 35
  • 34
6

Main HTMLAgilityPack related code is as follows

using System;
using System.Net;
using System.Web;
using System.Web.Services;
using System.Web.Script.Services;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

namespace GetMetaData
{
    /// <summary>
    /// Summary description for MetaDataWebService
    /// </summary>
    [WebService(Namespace = "http://tempuri.org/")]
    [WebServiceBinding(ConformsTo = WsiProfiles.BasicProfile1_1)]
    [System.ComponentModel.ToolboxItem(false)]
    // To allow this Web Service to be called from script, using ASP.NET AJAX, uncomment the following line.
    [System.Web.Script.Services.ScriptService]
    public class MetaDataWebService: System.Web.Services.WebService
    {
        [WebMethod]
        [ScriptMethod(UseHttpGet = false)]
        public MetaData GetMetaData(string url)
        {
            MetaData objMetaData = new MetaData();

            //Get Title
            WebClient client = new WebClient();
            string sourceUrl = client.DownloadString(url);

            objMetaData.PageTitle = Regex.Match(sourceUrl, @
            "\<title\b[^>]*\>\s*(?<Title>[\s\S]*?)\</title\>", RegexOptions.IgnoreCase).Groups["Title"].Value;

            //Method to get Meta Tags
            objMetaData.MetaDescription = GetMetaDescription(url);
            return objMetaData;
        }

        private string GetMetaDescription(string url)
        {
            string description = string.Empty;

            //Get Meta Tags
            var webGet = new HtmlWeb();
            var document = webGet.Load(url);
            var metaTags = document.DocumentNode.SelectNodes("//meta");

            if (metaTags != null)
            {
                foreach(var tag in metaTags)
                {
                    if (tag.Attributes["name"] != null && tag.Attributes["content"] != null && tag.Attributes["name"].Value.ToLower() == "description")
                    {
                        description = tag.Attributes["content"].Value;
                    }
                }
            } 
            else
            {
                description = string.Empty;
            }
            return description;
        }
    }
}
captainsac
  • 2,484
  • 3
  • 27
  • 48
5
    public string HtmlAgi(string url, string key)
    {

        var Webget = new HtmlWeb();
        var doc = Webget.Load(url);
        HtmlNode ourNode = doc.DocumentNode.SelectSingleNode(string.Format("//meta[@name='{0}']", key));

        if (ourNode != null)
        {


                return ourNode.GetAttributeValue("content", "");

        }
        else
        {
            return "not fount";
        }

    }
ibrahim ozboluk
  • 421
  • 5
  • 10
0

Getting Started - HTML Agility Pack

// From File
var doc = new HtmlDocument();
doc.Load(filePath);

// From String
var doc = new HtmlDocument();
doc.LoadHtml(html);

// From Web
var url = "http://html-agility-pack.net/";
var web = new HtmlWeb();
var doc = web.Load(url);
Meysam
  • 23
  • 1
  • 9
0

try this

string htmlBody = ParseHmlBody(dtViewDetails.Rows[0]["Body"].ToString());

private string ParseHmlBody(string html)
        {
            string body = string.Empty;
            try
            {
                var htmlDoc = new HtmlDocument();
                htmlDoc.LoadHtml(html);
                var htmlBody = htmlDoc.DocumentNode.SelectSingleNode("//body");
                body = htmlBody.OuterHtml;
            }
            catch (Exception ex)
            {

                dalPendingOrders.LogMessage("Error in ParseHmlBody" + ex.Message);
            }
            return body;
        }
PK-1825
  • 1,431
  • 19
  • 39