Parsing with Async, HtmlAgilityPack, and XPath

Question

I have run into a rather strange problem. It's very hard to explain so please bear with me, but basically here is a brief introduction:

I am new to Async programming but couldn't locate a problem in my code
I have used HtmlAgilityPack before, but never the .NET 4.5 version.
This is a learning project, I am not trying to scrape or anything like that.

Basically, what is happening is this: I am retrieving a page from the internet, loading it via stream into an HtmlDocument, then retrieving certain HtmlNodes from it using XPath expressions. Here is a piece of simplified code:

            myStream = await httpClient.GetStreamAsync(string.Format("{0}{1}", SomeString, AnotherString);

            using (myStream)
            {
                myDocument.Load(myStream);
            }

The HTML is being retreived correctly, but the HtmlNodes extracted by XPath are getting their HTML mangled. Here is a sample piece of HTML which I got in a response taken from Fiddler:

                    <div id="menu">
   <div id="splash">
      <div id="menuItem_1" class="ScreenTitle"  >Horse Racing</div>
      <div id="menuItem_2" class="Title"  >Wednesday Racing</div>
      <div id="subMenu_2">
         <div id="menuItem_3" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361707-2-20181217-0-0-1-0-0-4020-0-36200255-1-0-0-0-0">21.51 Britannia Way</a></div>
         <div id="menuItem_4" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361710-2-20181217-0-0-1-0-0-4020-0-36200258-1-0-0-0-0">21.54 Britannia Way</a></div>
         <div id="menuItem_5" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361713-2-20181217-0-0-1-0-0-4020-0-36200261-1-0-0-0-0">21.57 Britannia Way</a></div>
         <div id="menuItem_6" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361716-2-20181217-0-0-1-0-0-4020-0-36200264-1-0-0-0-0">22.00 Britannia Way</a></div>
         <div id="menuItem_7" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361719-2-20181217-0-0-1-0-0-4020-0-36200267-1-0-0-0-0">22.03 Britannia Way</a></div>
         <div id="menuItem_8" class="Level2"  >&#187;  <a href="../coupon/?ptid=4020&amp;key=2-70-70-22361722-2-20181217-0-0-1-0-0-4020-0-36200270-1-0-0-0-0">22.06 Britannia Way</a></div>
      </div>
   </div>
</div>

The XPath I am using is 100% correct because it works in the browser on the same page, but here is an example a tag which it is retreiving from the previously shown page:

<a href="./coupon/?ptid=4020&amp;key=2-70-70-22361710-2-20181217-0-0-1-0-0-4020-0-36200258-1-0-0-0-0"">1.54 Britannia Way</</a>

And here is the original which I copied from above for simplicity:

<a href="../coupon/?ptid=4020&amp;key=2-70-70-22361710-2-20181217-0-0-1-0-0-4020-0-36200258-1-0-0-0-0">21.54 Britannia Way</a></div>

As you can see, the InnerText has changed considerably and so has the URL. Obviously my program doesn't work, but I don't know how. What can cause this? Is it a bug in HtmlAgilityPack? Please advise! Thanks for reading!

score 2 · Answer 1 · edited May 23 '17 at 10:33

2

Don't make the assumption that an XPath expression working in your browser (after DOM-conversion, possibly loading data with AJAX, ...). This seems a site giving bet quotes, I'd guess they're loading the data with some javascript calls.

Verify whether your XPath expression matches the pages source code (like fetched using wget or by clicking "View Source Code" in your browser – don't use Firebug/... for this!

If the site is using AJAX to load the data, you might have luck by using Firebug to monitor what resources get fetched while the page is loaded. Often these are JSON- or XML-files very easy to parse, and it's even easier to work with them than parsing a website of horrible messes of HTML.

Update: In this special case, the site forwards users not sending an Accept-Language header to a language-selection-page. Send such a header to receive the same contents as the browser does. In curl, it would look like this:

curl -H "Accept-Language: en-US;q=0.6,en;q=0.4" https://mobile.bet365.com/sport/splash/Default.aspx?Sport

edited May 23 '17 at 10:33

Community

1
1

answered Jun 11 '14 at 21:18

Jens Erat

37,523
16
80
96

Hi, thanks for your comment, but the site is not loading data with AJAX. The HTML I showed above is the raw HTML which I got from the response, and as you can see the HTML retrieved by the XPath expression is mangled (some data is missing, or added). – TheGateKeeper Jun 11 '14 at 21:23
We cannot help you any further if you do not give all relevant information: page URL, XPath expressions. A problem which cannot be reproduced usually will not get fixed. – Jens Erat Jun 11 '14 at 21:24
The URL is `https://mobile.bet365.com/sport/splash/Default.aspx?Sport=2&key=2&L=1`, then click on today's races, and the XPath expression is `//div[@id='subMenu_2']//a`. Even if the expression was incorrect, it should not get anything or get something other than what I want, not get what I want but with different HTML. Could this be a bug in HTMLAGILITYPACK? Can you try to reproduce it? – TheGateKeeper Jun 11 '14 at 21:29
If you fetch the site using wget/curl, you will realize getting forwarded to another page containing a language-chooser instead. – Jens Erat Jun 11 '14 at 21:33
Please read what I am saying. I have already dealt with the Language page, I am getting the exact HTML which I showed above. I want to know why and how is it possible that an XPath expression is obtaining HTML that is not present in the page!! – TheGateKeeper Jun 11 '14 at 21:39
You did _never_ write on dealing with any language chooser. XPath will never return anything not on the page. Are you _totally sure_ of getting the same HTML like in the browser, ie. by dumping what you received from the HTTP request? Please read about how to post an [SSCCE](http://sscce.org), with the information given everything further is just guesswork. – Jens Erat Jun 11 '14 at 21:43
I will get back to you tomorrow, but the HTML I posted above was taken from the response which I received from the server. I will check what was actually loaded in the HtmlDocument just to be sure though. – TheGateKeeper Jun 11 '14 at 21:47
What also might be (thus really check carefully, eg. by `diff`ing the output: some of those sites don't like getting scraped (guess why) and have some counter-measures. Handling wrong/faked data to recognized scrape attempts might be one of those. – Jens Erat Jun 11 '14 at 21:48
I thought about that before, but like I said I posted the exact HTML up top which I got and as you can see it is correct data. It could be something with the HtmlDocument's encoding; I will check tomorrow. Thanks. – TheGateKeeper Jun 11 '14 at 22:08
I figured it out mate, thanks a lot for your help. I posted the answer below! Hope it helps someone! – TheGateKeeper Jun 12 '14 at 22:33

score 0 · Accepted Answer · answered Jun 12 '14 at 22:33

After many hours of guessing and debugging, the problem turned out to be an HtmlDocument that I was re-using. I solved the problem by creating a new HtmlDocument each time I wanted to load a new page, instead of using the same one.

I hope this saves you time that I lost!

Parsing with Async, HtmlAgilityPack, and XPath

2 Answers2