Html nodes issue with HtmlAgilityPack

Question

I'm having a big trouble trying to parse these html contents with HtmlAgilityPack library.

In this piece of code, I would like to retrieve only the url (href) that reffers to uploaded.net, but I can't determine whether the url reffers to it.

<div class='downloads' id='download_block'>

    <h5 style='text-align:center'>FREE DOWNLOAD LINKS</h5>

    <h4>uploadable.ch</h4>
    <ul class='parts'>
        <li>
            <a href="http://url/..." target="_blank"> text here</a>
        </li>
    </ul>

    <h4>uploaded.net</h4>
    <ul class='parts'>
        <li>
            <a href="http://url/..." target="_blank"> text here</a>
        </li>
    </ul>

    <h4>novafile.com</h4>
    <ul class='parts'>
        <li>
            <a href="http://url/..." target="_blank"> text here</a>
        </li>
    </ul>

</div>

This is how it looks on the webpage

enter image description here

And this is what I have:

nodes = myHrmlDoc.DocumentNode.SelectNodes(".//div[@class='downloads']/ul[@class='parts']")

I can't just use an array-index to determine the position like:

nodes(0) = uploadable.ch node
nodes(1) = uploaded.net node
nodes(2) = novafile.com node

...because they could change the amount of nodes and its hosting positions.

Note that also the urls will not contains the hosting names, are redirections like:

http://xxxxxx/r/YEHUgL44xONfQAnCNUVw_aYfY5JYAy0DT-i--

What could I do, in C# or else VB.Net?.

score 2 · Accepted Answer · answered Apr 07 '15 at 18:20

2

this should do, untested though:

doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'uploaded.net')]/following-sibling::ul//a").Attributes["href"].Value

also use contains because you never know if the text contains spaces.

answered Apr 07 '15 at 18:20

Xi Sigma

2,292
2
13
16

Thanks, is the simplest and awesome answer, what kind of black magic is that? those sentences are really part of XPATH syntax?. just one question: the "contains" is case-insensitive? if yes then its just perfect. – ElektroStudios Apr 07 '15 at 18:43
@ElektroStudios yes it is case sensitive, do you want to make it case insenstive? you can but it will get ugly – Xi Sigma Apr 07 '15 at 18:46
I should make it case-insensitive to prevent future headaches, if you could help me – ElektroStudios Apr 07 '15 at 18:47
1

@ElektroStudios http://stackoverflow.com/questions/8474031/case-insensitive-xpath-contains-possible – Xi Sigma Apr 07 '15 at 18:48
and cannot just use patterns like `"*ploaded*"`? is that possible? – ElektroStudios Apr 07 '15 at 18:51
1

@ElektroStudios in XPATH2 you can but HAP supports XPATH 1.0 only as far as i know, still in XPATH 1.0 you have functions like starts-with ends-with substring etc.., if it gets really complicated thats when i use Linq and regex – Xi Sigma Apr 07 '15 at 18:54
1

@ElektroStudios but note it can be `contains(text(),'ploaded')` for sure! – Xi Sigma Apr 07 '15 at 19:03
What kind of XPath voodoo is this? I need to learn to use that more. – Matt Apr 27 '15 at 20:23

TyCobb · Answer 2 · 2015-04-07T17:02:22.883

The only way I see this working is 2 fold approach. Sorry, I don't have HtmlAgilityPack at hand, but here is an example of using the standard XmlDocument. Even though you said you can't use array indexes to access, this process should allow you to do that by specifically grabbing the correct index dynamically.

void Main()
{
    var xml = @"
<div class=""downloads"" id=""download_block"">
    <h5 style=""text-align:center"">FREE DOWNLOAD LINKS</h5>
    <h4>uploadable.ch</h4>
    <ul class=""parts"">
        <li>
            <a href=""http://url/..."" target=""_blank""> text here</a>
        </li>
    </ul>
    <h4>uploaded.net</h4>
    <ul class=""parts"">
        <li>
            <a href=""http://upload.net/..."" target=""_blank""> text here</a>
        </li>
    </ul>
    <h4>novafile.com</h4>
    <ul class=""parts"">
        <li>
            <a href=""http://url/..."" target=""_blank""> text here</a>
        </li>
    </ul>
</div>";

 var xmlDocument = new XmlDocument();
 xmlDocument.LoadXml(xml);

 var nav = xmlDocument.CreateNavigator();
 var index = nav.Evaluate("count(//h4[text()='uploaded.net']/preceding-sibling::h4)+1").ToString();
 var text = xmlDocument.SelectSingleNode("//ul["+index +"]//a/@href").InnerText;

 Console.WriteLine(text);
}

Basically, it gets the index of the uploaded.net h4 and then uses that index to select the correct ul tag and get the URL out the of underlying anchor tag.

Sorry for the not so clean and error prone code, but it should get you in the right direction.

Thanks so much for the h4's index approach ! – ElektroStudios Apr 07 '15 at 18:03 — ElektroStudios, Apr 07 '15 at 18:03

score 1 · Answer 3 · answered Apr 07 '15 at 17:22

Give the snippet you supplied, this will help you get started.

var page = "<div class=\"downloads\" id=\"download_block\">    <h5 style=\"text-align:center\">FREE DOWNLOAD LINKS</h5>    <h4>uploadable.ch</h4>    <ul class=\"parts\">        <li>            <a href=\"http://url/...\" target=\"_blank\"> text here</a>        </li>    </ul>    <h4>uploaded.net</h4>    <ul class=\"parts\">        <li>            <a href=\"http://url/...\" target=\"_blank\"> text here</a>        </li>    </ul>    <h4>novafile.com</h4>    <ul class=\"parts\">        <li>            <a href=\"http://url/...\" target=\"_blank\"> text here</a>        </li>    </ul></div>";

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(page);

var nodes = doc.DocumentNode.Descendants("h4").Where(n => n.InnerText.Contains("uploadable"));
foreach (var node in nodes)
{
    var attr = node.NextSibling.NextSibling.Descendants().Where(x=> x.Name == "a").FirstOrDefault().Attributes["href"];
    attr.Value.Dump();
}

Html nodes issue with HtmlAgilityPack

3 Answers3