0

I'm using a function to get all the avaliable XPath expression from an HTML, using HtmlAgilityPack library.

The problem is that I get expressions with this format:

/html[1]/body[1]/div[1]/div[1]/div[1]/div[1]/h4[1]/a[1]

I would improve it to get also the names of the nodes/elements, like this:

/html/body/div[@class='infolinks']/div[@class='music']/div[@class='item']/div[@class='release']/h4[1]/a[@title]

But I don't know how to properly get their names with HtmlAgilityPack.

How I could do it?.

Note: I'm not any XPath expert sorry if the syntax of the XPaths are bad or I missunderstand things.


The webpage sourcecode that I'm trying:

<div class="infolinks"><input type="hidden" name="IL_IN_TAG" value="1"/></div><div id="main">

    <div class="music">

        <h2 class="boxtitle">New releases \ <small>
            <a href="/newalbums" title="New releases mp3 downloads" rel="bookmark">see all</a></small>
        </h2>

        <div class="item">

            <div class="thumb">
                <a href="http://www.mp3crank.com/curt-smith/deceptively-heavy-121861" rel="bookmark" lang="en" title="Curt Smith - Deceptively Heavy album downloads"><img width="100" height="100" alt="Mp3 downloads Curt Smith - Deceptively Heavy" title="Free mp3 downloads Curt Smith - Deceptively Heavy" src="http://www.mp3crank.com/cover-album/Curt-Smith-Deceptively-Heavy-400x400.jpg"/></a>
            </div>

            <div class="release">
                <h3>Curt Smith</h3>
                <h4>
                    <a href="http://www.mp3crank.com/curt-smith/deceptively-heavy-121861" title="Mp3 downloads Curt Smith - Deceptively Heavy">Deceptively Heavy</a>
                </h4>
                <script src="/ads/button.js"></script>
            </div>

            <div class="release-year">
                <p>Year</p>
                <span>2013</span>
            </div>

            <div class="genre">
                <p>Genre</p>
                <a href="http://www.mp3crank.com/genre/indie" rel="tag">Indie</a><a href="http://www.mp3crank.com/genre/pop" rel="tag">Pop</a>
            </div>

        </div>

        <div class="item">

            <div class="thumb">
                <a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" rel="bookmark" lang="en" title="Wolf Eyes - Lower Demos album downloads"><img width="100" height="100" alt="Mp3 downloads Wolf Eyes - Lower Demos" title="Free mp3 downloads Wolf Eyes - Lower Demos" src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg" /></a>
            </div>

            <div class="release">
                <h3>Wolf Eyes</h3>
                <h4>
                    <a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" title="Mp3 downloads Wolf Eyes - Lower Demos">Lower Demos</a>
                </h4>
                <script src="/ads/button.js"></script>
            </div>

            <div class="release-year">
                <p>Year</p>
                <span>2013</span>
            </div>

            <div class="genre">
                <p>Genre</p>
                <a href="http://www.mp3crank.com/genre/rock" rel="tag">Rock</a>
            </div>

        </div>

    </div>

</div>


The function to get XPaths:

Public Function GetXPaths(ByVal Document As HtmlAgilityPack.HtmlDocument) As List(Of String)

    Dim XPathList As New List(Of String)
    Dim XPath As String = String.Empty

    For Each Child As HtmlAgilityPack.HtmlNode In Document.DocumentNode.ChildNodes

        If Child.NodeType = HtmlAgilityPack.HtmlNodeType.Element Then
            GetXPaths(Child, XPathList, XPath)
        End If

    Next ' child'

    Return XPathList

End Function

Private Sub GetXPaths(ByVal Node As HtmlAgilityPack.HtmlNode,
                      ByRef XPathList As List(Of String),
                      Optional ByVal XPath As String = Nothing)

    XPath = Node.XPath

    If Not XPathList.Contains(XPath) Then
        XPathList.Add(XPath)
    End If

    For Each Child As HtmlAgilityPack.HtmlNode In Node.ChildNodes

        If Child.NodeType = HtmlAgilityPack.HtmlNodeType.Element Then
            GetXPaths(Child, XPathList, XPath)

        End If

    Next ' child

End Sub

And these are the XPaths that I use to retrieve some values, I would like to get more or less the same XPath fully-qualified representation on the function above.

Title = node.SelectSingleNode(".//div[@class='release']/h4/a[@title]").GetAttributeValue("title", "Unknown Title")
Cover = node.SelectSingleNode(".//div[@class='thumb']/a/img[@src]").GetAttributeValue("src", String.Empty)
Year = node.SelectSingleNode(".//div[@class='release-year']/span").InnerText
Genres = (From genre In node.SelectNodes(".//div[@class='genre']/a") Select genre.InnerText).ToArray
URL = node.SelectSingleNode(".//div[@class='release']/h4/a[@href]").GetAttributeValue("href", "Unknown URL")

ElektroStudios
  • 19,105
  • 33
  • 200
  • 417
  • html, body, div, etc are element name. Did you mean to get class name of the element (if any)? – har07 Aug 19 '14 at 06:05

1 Answers1

1

This will append class attribute filter to the XPath if corresponding element has class attribute :

Private Sub GetHtmlXPaths(ByVal Node As HtmlAgilityPack.HtmlNode,
                          ByRef XPathList As List(Of String),
                          Optional ByVal XPath As String = Nothing)

    XPath &= Node.XPath.Substring(Node.XPath.LastIndexOf("/"c))

    Const ClassNameFilter As String = "[@class='{0}']"
    Dim ClassName As String = Node.GetAttributeValue("class", String.Empty)

    If Not String.IsNullOrEmpty(ClassName) Then
        XPath &= String.Format(ClassNameFilter, ClassName)
    End If

    If Not XPathList.Contains(XPath) Then
        XPathList.Add(XPath)
    End If

    For Each Child As HtmlAgilityPack.HtmlNode In Node.ChildNodes

        If Child.NodeType = HtmlAgilityPack.HtmlNodeType.Element Then
            GetHtmlXPaths(Child, XPathList, XPath)
        End If

    Next Child

End Sub
ElektroStudios
  • 19,105
  • 33
  • 200
  • 417
har07
  • 88,338
  • 12
  • 84
  • 137
  • The function is not giving the desired results, is missing XPaths for example when are more than one divs with the same class name (for example the class name "Item" in the souce that I've provided above) the other divs counts as added (are not added by false positive), I've taken the liberty of modify your solution to fix the issue, sorry if this annoyed you feel free to revert my modification (or improve it). sorry for my english and thanks for your answer – ElektroStudios Aug 19 '14 at 08:15
  • Just now instead adding something in the XPath like `div[@class='item']` now I'm adding `div[1][@class='item']` & `div[2][@class='item']`, these xpaths works as expected using htmlagilitypath – ElektroStudios Aug 19 '14 at 08:18