if .querySelector() supports :nth-of-type(), what's wrong with
.querySelector("table:nth-of-type(2) tr") when I use it within the
script which doesn't work
It is supported when using Microsoft Internet Controls
to automate browser (IE8+) and created HTMLDocument
off ie.Document
. You then have access to a very small number of pseudo class selectors. This is not the case with HTMLDocument
when innerHTML
is provided via MSXML2.XMLHTTP
. Remember that the content you feed into your HTMLDocument
variable .innerHTML
will be different in XHR where javascript won't run versus IE
where the js will run and the browser will modify content/request additional files leaving you with a modified .document
. As mentioned at start, there is of course also a browser/document mode dependency for the latter.
The selector table:nth-of-type(2) tr
, even if supported, would not be appropriate here.
The value of the item (Year Built) I'm interested in is not always in
the same index, so using index is a wrong idea here
Based on closer inspection of your code it seems that the variability you are trying to account for is the potential difference in number of columns in the target table and thus the potential for your element to reside in a td
at a different index within a given row (you are not attempting to account for row variability for example...). So we are looking for a different relationship overall; no relationship between elements being required; or to dynamically determine the appropriate index; or maybe even a combination of these.
IMO considerations being:
- Same URI but an alternate element on the page with a shorter, hopefully more robust, selector;
- Different XHR URI where the desired element is associated with a more robust selector e.g. an unique id;
- A
script
tag with a nice regex grabbable string (var yearBuilt = 1234;);
- A positional strategy that has fewer dependencies and/or, based on experience, a higher probability of stability over time
Additionally,
- Optimized for faster retrieval
I recognise that the above are re-hashings of the same overall idea.
Looking at the considerations and the two links provided:
The year built, associated with MAIN AREA
, is only present in one place in the document. Note: I retain the assumption that this is the next row down from the appropriate header row. I haven't examined enough links to know whether this year value can vary by area of property, and you have not stated which is the required. MAIN AREA
appears, in this sample, to be the first part listed with build date.
The page does not appear to retrieve the required content from additional requests so an alternate source is not immediately obvious. There doesn't appear to be a dedicated public API. The search functionality doesn't provide the neccessary info from its POST requests, and the downloadable files have 3-4 month lag, are .txt predominantly and do not offer any realistic opportunities for faster identification of the required info (in fact would be a lot more work and less reliable).
This leaves consideration 4. You need a way to locate the right column in the right table. The html has a very repetitive structure with few nice 'hooks'. Rather than generating a more fragile path dependant on relationships of tables you have sensibly opted for a loop over tr
s (ergo should be in a table) looking for a key header string in tr
innerText
. So, traded off the risk of header string occuring in a different column and/or different table for the shorter traversal path and flexibility to move to the next row which is assumed to contain the data of interest.
So far, I think good choices, though I personally would opt for limiting the search to headers (th
), and then step up to parent. The additional benefit here is that I could then mitigate for your next part:
.Item(i).NextSibling.LastChild.PreviousSibling.innerText
Here you have built in an unneccessary assumption/risk that your column of interest will always be the penultimate. Though you could loop all headers and go up to parent node I would take the risk of restricting to appropriate table by searching for a unique string in the panel-heading and then grab the next-sibling
table before examining headers. It introduces IMO a reasonable assumption regarding relationship of panel heading
to table
and of panel
content. This then allows us to find the right index for the header based off the table
and use that index to index into the tds
of the next row. This mitigates for position not being the penultimate. You could then look for some further optimizations. I went with setting matches into variables for faster referencing.
A few more lines of code but no greater complexity, more security over matching on the right element, suitable exit strategies and less looping (due to targeting of table) despite there being two loop structures.
Overall your strategy is a good one. I personally would trade the risk of trying to get right table over assuming the right column is the penultimate one. I went with a slightly different relationship and dynamically determining the right index.
I'm not entirely satisified with the solution but it feels good enough.
VBA:
Option Explicit
Public Sub GetInformation()
Dim Http As New XMLHTTP60, links, i&
Dim htmlDoc As New HTMLDocument, link
Dim Wb As Workbook, ws As Worksheet, r&
Set Wb = ThisWorkbook
Set ws = Wb.Worksheets("Sheet1")
links = Array( _
"https://esearch.brazoscad.org/Property/View/114414", _
"https://esearch.brazoscad.org/Property/View/117608" _
)
For Each link In links
With Http
.Open "GET", link, False
.send
htmlDoc.body.innerHTML = .responseText
End With
Dim panels As Object, table As Object, headers As Object
Set panels = htmlDoc.querySelectorAll(".panel-heading")
For i = 0 To panels.Length - 1
If InStr(panels.Item(i).innerText, "Property Improvement - Building") > 0 Then
Set table = panels.Item(i).NextSibling 'assumption on relationship
Exit For
End If
Next i
If Not table Is Nothing Then
Set headers = table.getElementsByTagName("th")
For i = 0 To headers.Length - 1
If InStr(headers(i).innerText, "Year Built") > 0 Then
r = r + 1: ws.Cells(r, 1) = headers(i).ParentNode.NextSibling.Children(i).innerText
Exit For
End If
Next
End If
Set htmlDoc = Nothing: Set table = Nothing
Next link
End Sub
References (VBE>Tools>References):
- Microsoft HTML Object Library
- Microsoft XML v(n) 'your version