4

Intro:

Some of you may have noticed that something has broken in relation to the querySelectorAll method of MSHTML.HTMLDocument from MSHTML.Dll (via a Microsoft HTML Document Library reference). This, I believe, has happened in the last month. It may not affect all users and I will update this Q&A as I get more info on which versions etc are affected. Please feel free to comment below with your set-up and whether working or not for both late-bound and early-bound (as per code in answer)


Accessing DispStaticNodeList methods:

Traditionally, at least in my experience, it has been the norm to hold a reference to the DispStaticNodeList, which is what querySelectorAll returns, in a generic late-bound Object type:

E.g.

Dim nodeList1 As Object

Set nodeList1 = html.querySelectorAll("a")

where html is an instance of MSHTML.HTMLDocument.

As you can see from the Locals window, you get the expected nodeList shown:

enter image description here

You could then access the list of the document's elements, that match the specified group of selectors, with .item(index), and get the number of items matched with .Length. E.g.

Debug.Print nodeList1.item(0).innerText
Debug.Print nodeList1.Length

What happens now?

Attempts to access the methods, via late bound Object, and its underlying interfaces, lead to either an Object required, when using the .item() method call, or Null when querying the .Length(). E.g.

nodeList1.item(0).innertext  ' => Run-time error '424': Object required
Debug.Print nodeList1.Length ' => Null 

This happens when you hold a reference through assigning to a variable.


What you can do:

You can use With and work off html, avoiding the Object class

With html.querySelectorAll("a")
    For i = 0 To .Length - 1
       Debug.Print .Item(i).innerText
    Next
End With

So, I think the problem is very much about the Object data type and its underlying interfaces. And possibly, something about this has broken in relation to MSHTML, and most likely, the now no longer supported, Internet Explorer, which sits in the background:

However, this is not desirable, as you parse, and re-parse, the same HTML, during the loop, losing much of the efficiency that its gained by choosing css selectors over traditional methods e.g. getElementsByClassName. Those traditional methods remain intact.


Why do some of us care?

Modern browsers (and even IE8 onwards) support faster node matching through use of css selectors. It seems reasonable to assume that this carried over into the DOM parsers with MSHTML.HTMLDocument. So, you have faster matching, combined with more expressive and concise syntax (none of those long chained method calls e.g. getElementsByClassName("abc")(0).getElementsByTagName("def")(0).....), the ability to return more desired nodes, without repeated calls (in the prior example you will only get def as children of the first element with class abc, rather than all children, with tag def, of all elements with class abc, which you would get with querySelectorAll(".abc def"). And, you lose the flexibility to specify much more complex and specific patterns for node matching e.g. querySelectorAll(".abc > def + #ghi). For those interested, you can read more about those selectors on MSDN.


Question:

So, how does one avoid re-parsing, and hold the reference to the returned list of matched nodes? I have found nothing on the internet, despite quite a bit of searching, that documents this recent change in behaviour. It is also a very recent change and that likely only affects a small user base.

I hope the above satisfies the need to demonstrate research into the problem.


My set-up:

OS Name Microsoft Windows 10 Pro
Version 10.0.19042 Build 19042
System Type x64-based PC
Microsoft® Excel® 2019 MSO (16.0.13929.20206) 32-bit (Microsoft Office Professional Plus)
Version 2104 Build 13929.20373
mshtml.dll info as per image

Not affected (TBD):

  1. Office Professional plus 2013. Win 7, 32 bit, MSHTML.dll 11.0.9600.19597
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • 1
    Hi @QHarr, I suppose you wanna give [this post](https://stackoverflow.com/questions/67613429/got-results-in-one-pc-but-encountered-run-time-error-438-in-another-pc-even-wh) a go as the issue in there is somewhat related to what you have discussed here. – SMTH May 20 '21 at 05:02

1 Answers1

3

Do not despair VBA web-scrapers (I know there are a few!) We can still have the luxury of css selectors and the benefits, though admittedly somewhat limited in VBA, that they bring.

To the rescue:

MSHTML, gratias IE, offers a number of scripting object interfaces . One of which is the IHTMLDOMChildrenCollection interface, which inherits from IDispatch, and which:

provides methods to access items in the collection.

This includes the .Length property and access to items via .item(index).

Dim nodeList2 As MSHTML.IHTMLDOMChildrenCollection

Set nodeList2 = html.querySelectorAll("a")
Debug.Print nodeList2.Length                 ' => n 
Debug.Print nodeList2.Item(0).innerText

This is supported on clients Windows XP +, and servers from Windows 2000 Server onwards.


VBA:

Public Sub ReviewingNodeListMethods()
    '' References (VBE > Tools > References):
          ''Microsoft HTML object Library
          ''Microsoft XML library (v.6 for me)

    Dim http As MSXML2.XMLHTTP60, html As MSHTML.HTMLDocument   'XMLHTTP60 is for Excel 2016. Change according to your version e.g. XMLHTTP for 2013)
    
    Set http = New MSXML2.XMLHTTP60: Set html = New MSHTML.HTMLDocument
    
    With http
        .Open "GET", "http://books.toscrape.com/", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim nodeList1 As Object, nodeList2 As MSHTML.IHTMLDOMChildrenCollection
    
    Set nodeList1 = html.querySelectorAll("a")
    Set nodeList2 = html.querySelectorAll("a")
  
    Debug.Print nodeList1.Length                 ' => Null
    Debug.Print nodeList2.Length                 ' => 94
    
    Debug.Print nodeList2.Item(0).innerText
    
    '    Dim i As Long
    '
    '    With html.querySelectorAll("a")
    '        For i = 0 To .Length - 1
    '           Debug.Print .Item(i).innerText
    '        Next
    '    End With
    
    '' ================Warning: This will crash Excel -============================

    '    Dim node As MSHTML.IHTMLDOMNode
    '
    '    For Each node In nodeList2
    '        Debug.Print node.innerText
    '    Next
    '' ================Warning: This will crash Excel -============================

End Sub

N.B. There is still the underlying problem of the collection enumeration method; it causes Excel to crash if you attempt a For Each e.g.

Dim node As MSHTML.IHTMLDOMNode

For Each node In nodeList2
    Debug.Print node.innerText
Next

Updating your old Questions/Answers:

  1. You can use this SEDE query to identify potential candidates for revision. Enter your userid and the search term "querySelectorAll"
  2. Or simply use the following in the search bar: querySelectorAll user:<userid> is:answer ; querySelectorAll user:<userid> is:question
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • 1
    I get 94 from either of the node list. – SIM May 19 '21 at 10:38
  • Interesting. Can you provide me with some version/system info please? It would be helpful to document. – QHarr May 19 '21 at 10:40
  • 1
    Excel version: Office Professional plus 2013. Win 7, 32 bit. – SIM May 19 '21 at 10:43
  • thank you for that. What is your mshtml.dll install please? It will be in a folder under C:\Windows\ e.g. for me it is within `C:\Windows\SysWOW64` – QHarr May 19 '21 at 10:46
  • 1
    I didn't get any notification of your last text, so I would never know that you asked something unless I visit this page. However, [is this](https://imgur.com/w7c3XGC) what you wanted to know @QHarr? – SIM May 19 '21 at 11:28
  • @SIM You around please to help me with some testing? – QHarr May 29 '21 at 12:30
  • I was not around. I'll be available within few minutes @QHarr. – SIM May 29 '21 at 13:48
  • @SIM Hello. Join me in the [dawghaus](https://chat.stackoverflow.com/rooms/169987/dawgs-waffle-haus-) and ping me when free please. – QHarr May 29 '21 at 13:56
  • Why don't I see any option in the chat room to send you any text @QHarr? – SIM May 29 '21 at 19:13
  • I don't know. One moment. You are not showing as in the room at all. @SIM – QHarr May 29 '21 at 19:19
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/233062/discussion-between-qharr-and-sim). – QHarr May 29 '21 at 19:26
  • 1
    I think we can create a script which will run in all version of excel if we avoid using nextsibling. Please give [this](https://pastebin.com/1Xw9FKwU) a shot if you have time. If you get success, I'm sure that will be a workaround @QHarr. – SIM Jun 16 '21 at 19:52
  • Hello there @SIM \o – QHarr Jun 16 '21 at 19:52
  • Hi @QHarr, hope you are doing fine. – SIM Jun 16 '21 at 19:54
  • I like this `[A1:C1] = [{"Product Id"," Land Size","Total Main Area"}] ` – QHarr Jun 16 '21 at 19:55
  • Your linked code fails here: `Cells(R, 2) = " " & .Item(I).ParentNode.ParentNode.LastChild.getElementsByTagName("td")(2).innerText` – QHarr Jun 16 '21 at 19:57
  • Mathieu Guindon will berate you if he comes across your comment. He doesn't like that style at all. – SIM Jun 16 '21 at 19:57
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/233856/discussion-between-qharr-and-sim). – QHarr Jun 16 '21 at 19:58