2

My goal is to strip all Google search results from page 1 with VBA to Excel. Until so far I managed to strip the first result. The head, link and date are stored in cells 4, 5, 6. I now have to make a loop for the other li's, but I can't get it straight. Also the function that stores the date isn't very optimal coded I think. Anyone who knows the answer?

Sub XMLHTTP()

Dim url As String, lastRow As Long
Dim XMLHTTP As Object, html As Object, objResultDiv As Object, objH3 As Object, link As Object, Objdatum As Object, Ddatum As Object
Dim start_time As Date
Dim end_time As Date

lastRow = Range("A" & Rows.Count).End(xlUp).Row

Dim cookie As String
Dim result_cookie As String

start_time = Time
Debug.Print "start_time:" & start_time

For i = 2 To lastRow

    url = "https://www.google.co.in/search?q=" & Cells(i, 3) & "Skipr" & "&rnd=" & WorksheetFunction.RandBetween(1, 10000)

    Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setRequestHeader "Content-Type", "text/xml"
    XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
    XMLHTTP.send

        Set html = CreateObject("htmlfile")
    html.body.innerHTML = XMLHTTP.ResponseText
    Set objResultDiv = html.getelementbyid("rso")

    Set objH3 = objResultDiv.getelementsbytagname("H3")(0)
    Set link = objH3.getelementsbytagname("a")(0)
    Set Objdatum = objResultDiv.getelementsbytagname("span")(2)

    str_text = Replace(link.innerHTML, "<EM>", "")
    str_text = Replace(str_text, "</EM>", "")

    dat_text = Objdatum.innerHTML

    Cells(i, 4) = str_text
    Cells(i, 5) = link.href
    Cells(i, 6) = dat_text

    DoEvents
Next

end_time = Time
Debug.Print "end_time:" & end_time

Debug.Print "done" & "Time taken : " & DateDiff("n", start_time, end_time)
MsgBox "done" & "Time taken : " & DateDiff("n", start_time, end_time)

End Sub
Jbv
  • 23
  • 3

1 Answers1

0

You need to iterate the collections returned to you by your getelementsbytagname call instead of returning only the first element with the array index (0)

I had a similar project, below are some tips & my approach for your reference, it might assist you in working & maintaining the code in the future:

First, Instead of using CreateObject I prefer to reference the Object Libraries that expose the COM objects, this gives me the ability to browse the functions and properties of each object F2 and gives me code completion (speed & less bugs) within VBA editor (F7 takes you back to code view).

Add Reference Dialog

Giving me documentation and code completion: Code completion

Also, use these const for clarity

'see ready state : https://msdn.microsoft.com/en-us/library/ie/ms534361(v=vs.85).aspx
Const READYSTATE_UNINITIALIZED = 0
Const READYSTATE_LOADING = 1
Const READYSTATE_LOADED = 2
Const READYSTATE_INTERACTIVE = 3
Const READYSTATE_COMPLETE = 4

Finally, using DOMDocument60 to parse the XML into a Document Object Model in memory.

and MSHTML.HTMLDocument to parse the HTML document and iterate the table rows.

Below is code where I iterate all returned rows from a table within a html document embedded in the initial XML document returned by the webserver:

Dim xmlDoc As DOMDocument60
Set xmlDoc = GetXMLDocument("http://www.nbg.ge/rss.php")

'extract publication date
Debug.Print xmlDoc.getElementsByTagName("pubDate")(0).Text

'unwrap html document from CDATA in "//item/description" element
Dim htmlDoc As New MSHTML.HTMLDocument
htmlDoc.body.innerHTML = xmlDoc.SelectNodes("//item/description")(0).Text

'extract table data from html document
Dim tr As IHTMLElement, td As IHTMLElement
For Each tr In htmlDoc.getElementsByTagName("tr")
    For Each td In tr.Children
        'each cell in current row
        Debug.Print "  " & td.innerHTML
    Next td
    'next row
    Debug.Print "-----"
Next tr

Sample Data returned by webservice I was calling:

<rss version="2.0">
<channel>
<title>RSS NBG Currency Rates</title>
<link>https://www.nbg.gov.ge/index.php?m=236&lang=geo</link>
<description>Currency Rates</description>
<language>geo</language>
<copyright>Copyright 2015, NBG</copyright>
<pubDate>Wed, 29 Apr 2015 12:39:50 +0400</pubDate>
<lastBuildDate>Wed, 29 Apr 2015 12:39:50 +0400</lastBuildDate>
<managingEditor>alex@proservice.ge</managingEditor>
<webMaster>alex@proservice.ge</webMaster>
<item>
<title>Currency Rates 2015-04-29</title>
<link>https://www.nbg.gov.ge/index.php?m=236&lang=geo</link>
<description>
<![CDATA[
<table border="0">
    <tr> 
        <td>AED</td> 
        <td>10 არაბეთის გაერთიანებული საამიროების დირჰამი</td> 
        <td>6.2858</td> 
        <td><img src="https://www.nbg.gov.ge/images/green.gif"></td> 
        <td>0.0640</td> </tr><tr> <td>AMD</td> <td>1000 სომხური დრამი</td> 
        <td>4.8676</td> 
        <td><img src="https://www.nbg.gov.ge/images/green.gif"></td> 
        <td>0.0414</td> 
    </tr>
   </table>
]]>
</description>
<pubDate>Wed, 29 Apr 2015 12:39:50 +0400</pubDate>
<guid>
https://www.nbg.gov.ge/index.php?m=236&lang=geo&date=2015-04-29
</guid>
</item>
</channel>
</rss>

and the function that actually gets the document from the webserver (only works if you added the references as shown in above pictures)

Function GetXMLDocument(url As String) As MSXML2.DOMDocument60
    Dim xhr As New XMLHTTP60
    Dim doc As New DOMDocument60
    Dim msg As String

    With xhr
        .Open bstrMethod:="GET", bstrUrl:=url, varAsync:=False

        On Error GoTo SendError
        .send
        On Error GoTo 0

        'http status codes - http://en.wikipedia.org/wiki/List_of_HTTP_status_codes
        '200 = SUCCESS - OK
        If .readyState = READYSTATE_COMPLETE And .Status = 200 Then
            'Debug.Print .responseText
            doc.LoadXML (.responseText)
        Else
            msg = "Error" & vbNewLine & "Ready state: " & .readyState & _
                vbNewLine & "HTTP request status: " & .Status
            GoTo Error
        End If

    End With

    Set GetXMLDocument = doc
Exit Function

SendError:
    'by default access to data source accross internet dissabled
    'go to internet options & under security>custom level>Misc>access data sources accross domains> enable
    'see: http://stackoverflow.com/a/17402920
    MsgBox "Make sure access data sources accross domains is enabled under internet options>security>custom", vbOKOnly, "Could not send request to server"

Error:
    MsgBox msg, vbOKOnly, "Unexpected Error"

End Function
Vincent De Smet
  • 4,859
  • 2
  • 34
  • 41