0

I'm trying get the content of using HTML agility pack. Here's a sample of the HTML i'm trying to parse :

         <p itemprop="articleBody">
    Hundreds of thousands of Ukrainians filled the streets of Kiev on Sunday, first to hear speeches and music and then to fan out and erect barricades in the district where government institutions have their headquarters.</p><p itemprop="articleBody">
    Carrying blue-and-yellow Ukrainian and European Union flags, the teeming crowd filled 
Independence Square, where protests have steadily gained momentum since Mr. Yanukovich refused on Nov. 21 to sign trade and political agreements with the European Union. The square has been transformed by a vast and growing tent encampment, and demonstrators have occupied City Hall and other public buildings nearby. Thousands more people gathered in other cities across the country.        </p><p itemprop="articleBody">
    “Resignation! Resignation!” people in the Kiev crowd chanted on Sunday, demanding that Mr. Yanukovich and the government led by Prime Minister Mykola Azarov leave office.        </p>

I'm trying to parse the HTML above using the folllowing code :

HtmlAgilityPack.HtmlWeb nytArticlePage = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument nytArticleDoc = new HtmlAgilityPack.HtmlDocument();

System.Diagnostics.Debug.WriteLine(articleUrl);
nytArticleDoc = nytArticlePage.Load(articleUrl);
var articleBodyScope = 
        nytArticleDoc.DocumentNode.SelectNodes("//p[@itemprop='articleBody']");

EDIT:

But it seems like articleBodyScope is empty,because:

if (articleBodyScope != null)
{
    System.Diagnostics.Debug.WriteLine("CONTENT NOT NULL");
    foreach (var node in articleBodyScope)
    {
        articleBodyText += node.InnerText;
    }
}

Does not print "CONTENT NOT NULL" and articleBodyText remains empty. If anyone could point me to the solution i'd be grateful, thanks in advance !

Kjartan
  • 18,591
  • 15
  • 71
  • 96
  • `it seems like articleBodyScope is empty` but it is not. – EZI Dec 08 '13 at 23:07
  • @QtX , if it wasn't , i didn't have to post this message :) i've edited the post –  Dec 08 '13 at 23:10
  • Itamar , It it were empty I wouldn't comment so :) I took your html and XPath and loaded it to HtmlDocument. I got 3 items. – EZI Dec 08 '13 at 23:12
  • @QtX, first of all thanks for the answer. When i'm running the program ,i'm getting an error by Visual Studio saying i've get to check whether this object is null. Therefor, i assume it's empty. –  Dec 09 '13 at 08:16
  • @QtX, i've edited the post once again , hope not it will be clear. –  Dec 09 '13 at 08:59
  • When you view-source the contents? They might be loaded through Ajax or some other method which would cause the contents to not be there for the HtmlAgility Pach to load. Is there a public Uri we can check? You XPath is correct, so there must be something else going on here. – jessehouwing Dec 09 '13 at 10:15
  • @jessehouwing, here's an example of a page i'm trying to parse : http://www.nytimes.com/2013/12/10/world/asia/thailand-protests.html?partner=rss&emc=rss&_r=0 As far as I can see, there is no use of Ajax to load the page... –  Dec 09 '13 at 18:20
  • It looks like the New York Times is adding a bunch of non-valid-html-tags to the tagsoup. It might be that the HtmlAgilityPack is ignoring those on purpose... – jessehouwing Dec 09 '13 at 18:45

1 Answers1

0

It seems that the New York Times actually detects that you're not accepting cookies from them. As such, they present you with a cookie warning and a logon box. By actually providing a CookieContainer you can have .NET handle the whole cookie business under the hood and NYT will actually present you its contents:

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace UnitTestProject3
{
    using System.Net;
    using System.Runtime;

    using HtmlAgilityPack;

    [TestClass]
    public class UnitTest1
    {
        [TestMethod]
        public void WhenProvidingCookiesYouSeeContent()
        {
            HtmlDocument doc = new HtmlDocument();

            WebClient wc = new WebClientEx(new CookieContainer());

            string contents = wc.DownloadString(
                "http://www.nytimes.com/2013/12/10/world/asia/thailand-protests.html?partner=rss&emc=rss&_r=1&");
            doc.LoadHtml(contents);

            var nodes = doc.DocumentNode.SelectNodes(@"//p[@itemprop='articleBody']");

            Assert.IsNotNull(nodes);
            Assert.IsTrue(nodes.Count > 0);
        }
    }

    public class WebClientEx : WebClient
    {
        public WebClientEx(CookieContainer container)
        {
            this.container = container;
        }

        private readonly CookieContainer container = new CookieContainer();

        protected override WebRequest GetWebRequest(Uri address)
        {
            WebRequest r = base.GetWebRequest(address);
            var request = r as HttpWebRequest;
            if (request != null)
            {
                request.CookieContainer = container;
            }
            return r;
        }

        protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
        {
            WebResponse response = base.GetWebResponse(request, result);
            ReadCookies(response);
            return response;
        }

        protected override WebResponse GetWebResponse(WebRequest request)
        {
            WebResponse response = base.GetWebResponse(request);
            ReadCookies(response);
            return response;
        }

        private void ReadCookies(WebResponse r)
        {
            var response = r as HttpWebResponse;
            if (response != null)
            {
                CookieCollection cookies = response.Cookies;
                container.Add(cookies);
            }
        }
    }
}

With thanks to this answer for the extended WebClient class.

Note

It might be against the NYT terms of usage to blatantly scrape the new stories off their website.

Community
  • 1
  • 1
jessehouwing
  • 106,458
  • 22
  • 256
  • 341