0

I have a list of approximately 20,000 actors to check which has a German Wikipedia page.

I found a code with which you can search for urls via Google and get the first result copied into Excel.
Using VBA in Excel to Google Search in IE and return the hyperlink of the first result

I tried to restrict the search to the German Wikipedia by having Google search for German pages only. E.g. "site:de.wikipedia.org intitle:johnny depp"

This works for the known actors.

I get an error when I search for an actor that does not have his own page.

"Error 91: Object variable or with block variable not set"

How can I build a work-around that skips the actor when he/she has no own page and instead continues with the next in the list?

Or maybe you have a simpler solution.

Sample File

Sub XMLHTTP()
    Dim url As String, lastRow As Long
    Dim XMLHTTP As Object, html As Object, objResultDiv As Object, objH3 As Object, link As Object
    Dim start_time As Date
    Dim end_time As Date

    lastRow = Range("A" & Rows.Count).End(xlUp).Row
    
    Dim cookie As String
    Dim result_cookie As String
    
    start_time = Time
    Debug.Print "start_time:" & start_time

    For i = 2 To lastRow

        url = "https://www.google.de/search?q=" & Cells(i, 1) & "&rnd=" & WorksheetFunction.RandBetween(1, 10000)

        Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
        XMLHTTP.Open "GET", url, False
        XMLHTTP.setRequestHeader "Content-Type", "text/xml"
        XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
        XMLHTTP.send

        Set html = CreateObject("htmlfile")
        html.body.innerHTML = XMLHTTP.ResponseText
        Set objResultDiv = html.getelementbyid("rso")
        Set objH3 = objResultDiv.getelementsbytagname("H3")(0)
        Set link = objH3.getelementsbytagname("a")(0)

        str_text = Replace(link.innerHTML, "<EM>", "")
        str_text = Replace(str_text, "</EM>", "")

        Cells(i, 2) = str_text
        Cells(i, 3) = link.href
        DoEvents
    Next
    
    end_time = Time
    Debug.Print "end_time:" & end_time
    
    Debug.Print "done" & "Time taken : " & DateDiff("n", start_time, end_time)
    MsgBox "done" & "Time taken : " & DateDiff("n", start_time, end_time)
End Sub
Community
  • 1
  • 1
Stgrwld
  • 3
  • 1

2 Answers2

0

Check if objResultDiv element is found and if it is found, proceed further else write "Not Found" to the cells.

You may try something like this...

Sub XMLHTTP()
Dim url As String, lastRow As Long
Dim XMLHTTP As Object, html As Object, objResultDiv As Object, objH3 As Object, link As Object
Dim start_time As Date
Dim end_time As Date
Dim i As Long
Dim str_text As String

lastRow = Range("A" & Rows.Count).End(xlUp).Row

Dim cookie As String
Dim result_cookie As String

start_time = Time
Debug.Print "start_time:" & start_time

For i = 2 To lastRow

    url = "https://www.google.de/search?q=" & Cells(i, 1) & "&rnd=" & WorksheetFunction.RandBetween(1, 10000)

    Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setRequestHeader "Content-Type", "text/xml"
    XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
    XMLHTTP.send

    If XMLHTTP.Status = 200 Then
        Set html = CreateObject("htmlfile")
        html.body.innerHTML = XMLHTTP.ResponseText
        Set objResultDiv = html.getelementbyid("rso")

        If Not objResultDiv Is Nothing Then
            Set objH3 = objResultDiv.getelementsbytagname("H3")(0)
            Set link = objH3.getelementsbytagname("a")(0)


            str_text = Replace(link.innerHTML, "<EM>", "")
            str_text = Replace(str_text, "</EM>", "")

            Cells(i, 2) = str_text
            Cells(i, 3) = link.href
            DoEvents
        Else
            Cells(i, 2) = "Not Found"
            Cells(i, 3) = "Not Found"
        End If
    Else
        Cells(i, 2) = "Not Found"
        Cells(i, 3) = "Not Found"
    End If
Next

end_time = Time
Debug.Print "end_time:" & end_time

Debug.Print "done" & "Time taken : " & DateDiff("n", start_time, end_time)
MsgBox "done" & "Time taken : " & DateDiff("n", start_time, end_time)
End Sub
Subodh Tiwari sktneer
  • 9,906
  • 2
  • 18
  • 22
  • Thanky for the quick answer. If I try this code, it just gives out "Not Found" for every actor wheater he has a own page or not. So you have some suggestions? – Stgrwld Apr 28 '18 at 18:11
  • I tested it on your file and it worked without an issue. Here is the [File with code](https://drive.google.com/open?id=1VxY0dIpGOLUWUpVoTMCGIZtl9uAot6l_pvvV9fnRlJE) – Subodh Tiwari sktneer Apr 28 '18 at 18:50
  • [Here is the file again](https://docs.google.com/spreadsheets/d/1Q-Hxelh1wKT_6HuHFegHUyN2uZiOfM9E_JCE2a3zjTw/edit?usp=sharing) – Subodh Tiwari sktneer Apr 28 '18 at 18:55
  • Thank you very much!! This works out fine for me now! I think the problem yesterday was the google ip restriction after several search inqueries. But I guess there is no easy way around this? – Stgrwld Apr 29 '18 at 08:45
  • You may use a counter to count the queries and use `Application.Wait` for 5-10 seconds once the counter reaches say 50 and then reset the counter back to 0 so the next time it reaches 50 again, it will pause again for 5-10 seconds and so on. – Subodh Tiwari sktneer Apr 29 '18 at 08:52
  • Ok thank you. Another idea I had was to implement a lag of e.g. 0,5sec after each query. But I guess that would be the same concept. Do you an easy way to build this into the existing code? Thank you so much for your help! – Stgrwld Apr 29 '18 at 11:20
  • I think you should open a new question with your new requirement. – Subodh Tiwari sktneer Apr 29 '18 at 11:47
0

It's hard sometimes to scrape information from google using xmlhttp, serverxmlhttp or winhttp request. Even if you try with proxy, google can easily detect you as a bot so it will lead you to a captcha page and your attempt will be miserably failed. However, the safer approach in this case is to pilot IE. Try the below way instead. if you have IE9 then the .querySelector() defined within the scraper will rock.

Sub ScrapeGoogle()
    Dim IE As New InternetExplorer, HTML As HTMLDocument
    Dim cel As Range, URL$, post As Object

    For Each cel In Range("A2:A" & Cells(Rows.Count, 1).End(xlUp).row)
        URL = "https://www.google.de/search?q=" & Replace(cel.Value, " ", "%20")

        With IE
            .Visible = True
            .navigate URL
            While .Busy = True Or .readyState <> 4: DoEvents: Wend
            Set HTML = .document

            If Not HTML.querySelector(".rc h3.r a") Is Nothing Then
                Set post = HTML.querySelector(".rc h3.r a")
                cel(1, 2) = post.innerText
                cel(1, 3) = post.getAttribute("href")
            Else
                cel(1, 2) = "Nothing found"
                cel(1, 3) = "Sorry dear"
            End If
        End With
    Next cel
    IE.Quit
End Sub

Reference to add to the library:

Microsoft Internet Controls
Microsoft HTML Object Library
MITHU
  • 113
  • 3
  • 12
  • 41
  • Thank you very much. Is it essential to have internet explorer 9 installed to run this code? At the moment I just have Microsoft Edge and I get "runtime error 429: activex component can't create object windows" – Stgrwld Apr 29 '18 at 08:21
  • Yep, `IE9` to anything later is required for this script to work otherwise `.querySelector()` won't be supported. – MITHU Apr 29 '18 at 19:13