Can't parse a specific value of an item from a webpage using vba

Question

I've created a script in VBA to get a specific item from a webpage. The value of the item (Year Built) I'm interested in is not always in the same index, so using index is a wrong idea here. I'm giving two links below only because the value of the items are in different indices in two webpages.

site one

site two

My initial approach to get the value was:

.NextSibling.getElementsByTagName("td")(3).innerText

The value that I'm after is visible as:

What I'm trying with now (works but the position is still hypothetical and will break if the position changes):

.NextSibling.LastChild.PreviousSibling.innerText

I've created so far:

Sub GetInformation()
    Dim Http As New XMLHTTP60, links, i&
    Dim Htmldoc As New HTMLDocument, link
    Dim Wb As Workbook, ws As Worksheet, r&

    Set Wb = ThisWorkbook
    Set ws = Wb.Worksheets("Sheet1")

    links = Array( _
        "https://esearch.brazoscad.org/Property/View/114414", _
        "https://esearch.brazoscad.org/Property/View/117608" _
       )

    For Each link In links
        With Http
            .Open "GET", link, False
            .send
            Htmldoc.body.innerHTML = .responseText
        End With


        With Htmldoc.querySelectorAll("tr")
            For i = 0 To .Length - 1
                If InStr(.item(i).innerText, "Year Built") > 0 Then
                    r = r + 1: ws.Cells(r, 1) = .item(i).NextSibling.LastChild.PreviousSibling.innerText
                End If
            Next i
        End With
    Next link
End Sub

How can I get a specific value of an item from a web-page?

By the way, if .querySelector() supports :nth-of-type(), what's wrong with .querySelector("table:nth-of-type(2) tr") when I use it within the script which doesn't work.

QHarr · Accepted Answer · 2019-11-09T20:56:43.050

if .querySelector() supports :nth-of-type(), what's wrong with .querySelector("table:nth-of-type(2) tr") when I use it within the script which doesn't work

It is supported when using Microsoft Internet Controls to automate browser (IE8+) and created HTMLDocument off ie.Document. You then have access to a very small number of pseudo class selectors. This is not the case with HTMLDocument when innerHTML is provided via MSXML2.XMLHTTP. Remember that the content you feed into your HTMLDocument variable .innerHTML will be different in XHR where javascript won't run versus IE where the js will run and the browser will modify content/request additional files leaving you with a modified .document. As mentioned at start, there is of course also a browser/document mode dependency for the latter.

The selector table:nth-of-type(2) tr, even if supported, would not be appropriate here.

The value of the item (Year Built) I'm interested in is not always in the same index, so using index is a wrong idea here

Based on closer inspection of your code it seems that the variability you are trying to account for is the potential difference in number of columns in the target table and thus the potential for your element to reside in a td at a different index within a given row (you are not attempting to account for row variability for example...). So we are looking for a different relationship overall; no relationship between elements being required; or to dynamically determine the appropriate index; or maybe even a combination of these.

IMO considerations being:

Same URI but an alternate element on the page with a shorter, hopefully more robust, selector;
Different XHR URI where the desired element is associated with a more robust selector e.g. an unique id;
A script tag with a nice regex grabbable string (var yearBuilt = 1234;);
A positional strategy that has fewer dependencies and/or, based on experience, a higher probability of stability over time

Additionally,

Optimized for faster retrieval

I recognise that the above are re-hashings of the same overall idea.

Looking at the considerations and the two links provided:

The year built, associated with MAIN AREA, is only present in one place in the document. Note: I retain the assumption that this is the next row down from the appropriate header row. I haven't examined enough links to know whether this year value can vary by area of property, and you have not stated which is the required. MAIN AREA appears, in this sample, to be the first part listed with build date.

The page does not appear to retrieve the required content from additional requests so an alternate source is not immediately obvious. There doesn't appear to be a dedicated public API. The search functionality doesn't provide the neccessary info from its POST requests, and the downloadable files have 3-4 month lag, are .txt predominantly and do not offer any realistic opportunities for faster identification of the required info (in fact would be a lot more work and less reliable).

This leaves consideration 4. You need a way to locate the right column in the right table. The html has a very repetitive structure with few nice 'hooks'. Rather than generating a more fragile path dependant on relationships of tables you have sensibly opted for a loop over trs (ergo should be in a table) looking for a key header string in tr innerText. So, traded off the risk of header string occuring in a different column and/or different table for the shorter traversal path and flexibility to move to the next row which is assumed to contain the data of interest.

So far, I think good choices, though I personally would opt for limiting the search to headers (th), and then step up to parent. The additional benefit here is that I could then mitigate for your next part:

.Item(i).NextSibling.LastChild.PreviousSibling.innerText

Here you have built in an unneccessary assumption/risk that your column of interest will always be the penultimate. Though you could loop all headers and go up to parent node I would take the risk of restricting to appropriate table by searching for a unique string in the panel-heading and then grab the next-sibling table before examining headers. It introduces IMO a reasonable assumption regarding relationship of panel heading to table and of panel content. This then allows us to find the right index for the header based off the table and use that index to index into the tds of the next row. This mitigates for position not being the penultimate. You could then look for some further optimizations. I went with setting matches into variables for faster referencing.

A few more lines of code but no greater complexity, more security over matching on the right element, suitable exit strategies and less looping (due to targeting of table) despite there being two loop structures.

Overall your strategy is a good one. I personally would trade the risk of trying to get right table over assuming the right column is the penultimate one. I went with a slightly different relationship and dynamically determining the right index. I'm not entirely satisified with the solution but it feels good enough.

VBA:

Option Explicit

Public Sub GetInformation()
    Dim Http As New XMLHTTP60, links, i&
    Dim htmlDoc As New HTMLDocument, link
    Dim Wb As Workbook, ws As Worksheet, r&

    Set Wb = ThisWorkbook
    Set ws = Wb.Worksheets("Sheet1")

    links = Array( _
            "https://esearch.brazoscad.org/Property/View/114414", _
            "https://esearch.brazoscad.org/Property/View/117608" _
            )

    For Each link In links
        With Http
            .Open "GET", link, False
            .send
            htmlDoc.body.innerHTML = .responseText
        End With

        Dim panels As Object, table As Object, headers As Object

        Set panels = htmlDoc.querySelectorAll(".panel-heading")

        For i = 0 To panels.Length - 1
            If InStr(panels.Item(i).innerText, "Property Improvement - Building") > 0 Then
                Set table = panels.Item(i).NextSibling 'assumption on relationship
                Exit For
            End If
        Next i

        If Not table Is Nothing Then

            Set headers = table.getElementsByTagName("th")

            For i = 0 To headers.Length - 1
                If InStr(headers(i).innerText, "Year Built") > 0 Then
                    r = r + 1: ws.Cells(r, 1) = headers(i).ParentNode.NextSibling.Children(i).innerText
                    Exit For
                End If
            Next
        End If
        Set htmlDoc = Nothing: Set table = Nothing
    Next link
End Sub

References (VBE>Tools>References):

Microsoft HTML Object Library
Microsoft XML v(n) 'your version

Can't parse a specific value of an item from a webpage using vba

1 Answers1

Linked