4

I scrape some websites with vba for fun and I use VBA as tool. I use XMLHTTP and HTMLDocument (cause it's more faster than internetExplorer.Application).

Public Sub XMLhtmlDocumentHTMLSourceScraper()

    Dim XMLHTTPReq As Object
    Dim htmlDoc As HTMLDocument

    Dim postURL As String

    postURL = "http://foodffs.tumblr.com/archive/2015/11"

        Set XMLHTTPReq = New MSXML2.XMLHTTP

        With XMLHTTPReq
            .Open "GET", postURL, False
            .Send
        End With

        Set htmlDoc = New HTMLDocument
        With htmlDoc
            .body.innerHTML = XMLHTTPReq.responseText
        End With

        i = 0

        Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass")

        For Each vr In varTemp
            ''''the next line is important to solve this issue *1
            Cells(1, 1) = vr.outerHTML
            Set varTemp2 = vr.getElementsByTagName("SPAN class=post_date")
            Cells(i + 1, 3) = varTemp2.Item(0).innerText
            ''''the next line occur 438Error''''
            Set varTemp2 = vr.getElementsByClassName("hover_inner")
            Cells(i + 1, 4) = varTemp2.innerText

            i = i + 1

        Next vr
End Sub

I figure out this problem by *1 cells(1,1) shows me the next things

<DIV class="post_glass post_micro_glass" title=""><A class=hover title="" href="http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-really" target=_blank>
<DIV class=hover_inner><SPAN class=post_date>...............

Yeah all the class tag lost " ". only the first function's class has " " I really don't know why this situation occur.

//Well I could pharse by getElementsByTagName("span"). but I prefer "class" Tag.....

Soborubang
  • 43
  • 1
  • 6
  • http://stackoverflow.com/questions/7927905/internet-explorer-innerhtml-outputs-attributes-without-quotes I don't think HTML requires quotes around attribute values when the values contain no whitespace, and what you're seeing when you look at outerHTML reflects IE's representation of this. That's probably not the source of the error you're getting though. – Tim Williams Dec 16 '15 at 01:39
  • What happens if you try `Set varTemp2 = vr.querySelectorAll("span.post_date")` ? – barrowc Dec 16 '15 at 01:54
  • Thanks to all!! @TimWilliams I see. then getElementsByTagName("span") is the only way I could pharse innerText??? – Soborubang Dec 16 '15 at 04:02
  • @barrowc Sorry your code occur same Error message :( but, thanks for your help! – Soborubang Dec 16 '15 at 04:02
  • 1
    I'm running through some of your code now with an eye to alternate solutions but at a minimum `Cells(i + 1, 4) = varTemp2.innerText` should be `Cells(i + 1, 4) = varTemp2(0).innerText`. `getElementsByClassName` returns a collection, not a single object even if that collection is a collection of one element. –  Dec 16 '15 at 04:11
  • Thanks @Jeeped! That's a quiet good advice. but my problem is how to get the text from class tag. the only way is to use getElementsByTagName but it's no good enough for me. could you give me another advice? – Soborubang Dec 16 '15 at 04:20
  • @barrowc - **querySelector** and **querySelectorAll** are not yet available in a VBA DOM model. –  Dec 16 '15 at 05:34
  • 1
    @Jeeped - `querySelectorAll` works for me on an instance of `HTMLDocument` – Tim Williams Dec 16 '15 at 07:54
  • @TimWilliams - Thanks for that. I've still got that sample workbook lying around unsaved so I'll go back and test. –  Dec 16 '15 at 08:01

2 Answers2

5

The getElementsByClassName method is not considered a method of itself; only of the parent HTMLDocument. If you want to use it to locate elements within a DIV element, you need to create a sub-HTMLDocument comprised of the .outerHtml of that specific DIV element.

Public Sub XMLhtmlDocumentHTMLSourceScraper()

    Dim xmlHTTPReq As New MSXML2.XMLHTTP
    Dim htmlDOC As New HTMLDocument, divSUBDOC As New HTMLDocument
    Dim iDIV As Long, iSPN As Long, iEL As Long
    Dim postURL As String, nr As Long, i As Long

    postURL = "http://foodffs.tumblr.com/archive/2015/11"

    With xmlHTTPReq
        .Open "GET", postURL, False
        .Send
    End With

    'Set htmlDOC = New HTMLDocument
    With htmlDOC
        .body.innerHTML = xmlHTTPReq.responseText
    End With

    i = 0

    With htmlDOC
        For iDIV = 0 To .getElementsByClassName("post_glass post_micro_glass").Length - 1
            nr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Offset(1, 0).Row
            With .getElementsByClassName("post_glass post_micro_glass")(iDIV)
                'method 1 - run through multiples in a collection
                For iSPN = 0 To .getElementsByTagName("span").Length - 1
                    With .getElementsByTagName("span")(iSPN)
                        Select Case LCase(.className)
                            Case "post_date"
                                Cells(nr, 3) = .innerText
                            Case "post_notes"
                                Cells(nr, 4) = .innerText
                            Case Else
                                'do nothing
                        End Select
                    End With
                Next iSPN
                'method 2 - create a sub-HTML doc to facilitate getting els by classname
                divSUBDOC.body.innerHTML = .outerHTML  'only the HTML from this DIV
                With divSUBDOC
                    If CBool(.getElementsByClassName("hover_inner").Length) Then 'there is at least 1
                        'use the first
                        Cells(nr, 5) = .getElementsByClassName("hover_inner")(0).innerText
                    End If
                End With
            End With
        Next iDIV
    End With

End Sub

While other .getElementsByXXXX can readily retrieve collections within another element, the getElementsByClassName method needs to consider what it believes to be the HTMLDocument as a whole, even if you have fooled it into thinking that.

  • really thank you! I didn't know that getElementsByClassName is special one. Serously I admire you! – Soborubang Dec 16 '15 at 05:53
  • MDN has "You may also call `getElementsByClassName()` on any element; it will return only elements which are descendants of the specified root element with the given class names." and I'm pretty sure I've used it that way in the past in IE... – Tim Williams Dec 16 '15 at 07:16
  • Having said that, it doesn't work for me via VBA except when called on the `document` object. ;-[ – Tim Williams Dec 16 '15 at 07:58
  • @TimWilliams - The MDSN docs (linked above) have a few errors. They say it has a return value of type **Element** when it should be something like an HtmlElementCollection object. fwiw, I have seen several examples of VBA HTMLDocument method behavior that don't quite match what is expected from what the standards state (and javascript does properly). –  Dec 16 '15 at 08:11
1

Here's an alternative approach. It's very similar to the original code but uses querySelectorAll to select the relevant span elements. One important point for this method is that vr has to be declared as being a specific element type and not as an IHTMLElement or generic Object:

Option Explicit

Public Sub XMLhtmlDocumentHTMLSourceScraper()

' Changed from generic Object to specific type - not
' strictly necessary to do this
Dim XMLHTTPReq As MSXML2.XMLHTTP60
Dim htmlDoc As HTMLDocument

' These declarations weren't included in the original code
Dim i As Integer
Dim varTemp As Object
' IMPORTANT: vr must be declared as a specific element type and not
' as an IHTMLElement or generic Object
Dim vr As HTMLDivElement
Dim varTemp2 As Object

Dim postURL As String

postURL = "http://foodffs.tumblr.com/archive/2015/11"

' Changed from XMLHTTP to XMLHTTP60 as XMLHTTP is equivalent
' to the older XMLHTTP30
Set XMLHTTPReq = New MSXML2.XMLHTTP60

With XMLHTTPReq
    .Open "GET", postURL, False
    .Send
End With

Set htmlDoc = New HTMLDocument
With htmlDoc
    .body.innerHTML = XMLHTTPReq.responseText
End With

i = 0

Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass")

For Each vr In varTemp
   ''''the next line is important to solve this issue *1
   Cells(1, 1) = vr.outerHTML

   Set varTemp2 = vr.querySelectorAll("span.post_date")
   Cells(i + 1, 3) = varTemp2.Item(0).innerText

   Set varTemp2 = vr.getElementsByClassName("hover_inner")
   ' incorporating correction from Jeeped's comment (#56349646)
   Cells(i + 1, 4) = varTemp2.Item(0).innerText

   i = i + 1
Next vr

End Sub

Notes:

  • XMLHTTP equivalent to XMLHTTP30 as described here
  • apparent need to declare a specific element type explored in this question but, unlike getElementsByClassName, querySelectorAll doesn't exist in any version of IHTMLElement
Community
  • 1
  • 1
barrowc
  • 10,444
  • 1
  • 40
  • 53