I have run into a rather strange problem. It's very hard to explain so please bear with me, but basically here is a brief introduction:
- I am new to Async programming but couldn't locate a problem in my code
- I have used HtmlAgilityPack before, but never the .NET 4.5 version.
- This is a learning project, I am not trying to scrape or anything like that.
Basically, what is happening is this: I am retrieving a page from the internet, loading it via stream into an HtmlDocument
, then retrieving certain HtmlNodes
from it using XPath
expressions. Here is a piece of simplified code:
myStream = await httpClient.GetStreamAsync(string.Format("{0}{1}", SomeString, AnotherString);
using (myStream)
{
myDocument.Load(myStream);
}
The HTML is being retreived correctly, but the HtmlNodes extracted by XPath are getting their HTML mangled. Here is a sample piece of HTML which I got in a response taken from Fiddler:
<div id="menu">
<div id="splash">
<div id="menuItem_1" class="ScreenTitle" >Horse Racing</div>
<div id="menuItem_2" class="Title" >Wednesday Racing</div>
<div id="subMenu_2">
<div id="menuItem_3" class="Level2" >» <a href="../coupon/?ptid=4020&key=2-70-70-22361707-2-20181217-0-0-1-0-0-4020-0-36200255-1-0-0-0-0">21.51 Britannia Way</a></div>
<div id="menuItem_4" class="Level2" >» <a href="../coupon/?ptid=4020&key=2-70-70-22361710-2-20181217-0-0-1-0-0-4020-0-36200258-1-0-0-0-0">21.54 Britannia Way</a></div>
<div id="menuItem_5" class="Level2" >» <a href="../coupon/?ptid=4020&key=2-70-70-22361713-2-20181217-0-0-1-0-0-4020-0-36200261-1-0-0-0-0">21.57 Britannia Way</a></div>
<div id="menuItem_6" class="Level2" >» <a href="../coupon/?ptid=4020&key=2-70-70-22361716-2-20181217-0-0-1-0-0-4020-0-36200264-1-0-0-0-0">22.00 Britannia Way</a></div>
<div id="menuItem_7" class="Level2" >» <a href="../coupon/?ptid=4020&key=2-70-70-22361719-2-20181217-0-0-1-0-0-4020-0-36200267-1-0-0-0-0">22.03 Britannia Way</a></div>
<div id="menuItem_8" class="Level2" >» <a href="../coupon/?ptid=4020&key=2-70-70-22361722-2-20181217-0-0-1-0-0-4020-0-36200270-1-0-0-0-0">22.06 Britannia Way</a></div>
</div>
</div>
</div>
The XPath I am using is 100% correct because it works in the browser on the same page, but here is an example a
tag which it is retreiving from the previously shown page:
<a href="./coupon/?ptid=4020&key=2-70-70-22361710-2-20181217-0-0-1-0-0-4020-0-36200258-1-0-0-0-0"">1.54 Britannia Way</</a>
And here is the original which I copied from above for simplicity:
<a href="../coupon/?ptid=4020&key=2-70-70-22361710-2-20181217-0-0-1-0-0-4020-0-36200258-1-0-0-0-0">21.54 Britannia Way</a></div>
As you can see, the InnerText has changed considerably and so has the URL. Obviously my program doesn't work, but I don't know how. What can cause this? Is it a bug in HtmlAgilityPack? Please advise! Thanks for reading!