0

Currently I have extracted 13,000 URLs using the code below. However 3,000 of them came up with URLs from Facebook, Bloomberg, and so on. For these URLs I have been manually searching there names and maybe 1 in 20 have a company URL that the macro missed. So my question is this: Is there a way that I can edit the macro so that if a URL page contains a string value such as "facebook" or "wiki" that it will skip that URL and continue to search for a URL that DOES NOT contain the string value?

Code for how I extract URLs:

Sub XMLHTTP()

    Dim url As String, lastRow As Long
    Dim XMLHTTP As Object, html As Object, objResultDiv As Object, objH3 As Object, link As Object
    Dim start_time As Date
    Dim end_time As Date

    lastRow = Range("A" & Rows.Count).End(xlUp).Row

    Dim cookie As String
    Dim result_cookie As String

    start_time = Time
    Debug.Print "start_time:" & start_time

    For i = 2 To lastRow

        url = "https://www.google.co.in/search?q=" & Cells(i, 1) & "&rnd=" & WorksheetFunction.RandBetween(1, 10000)

        Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
        XMLHTTP.Open "GET", url, False
        XMLHTTP.setRequestHeader "Content-Type", "text/xml"
        XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
        XMLHTTP.send

            Set html = CreateObject("htmlfile")
        html.body.innerHTML = XMLHTTP.ResponseText
        Set objResultDiv = html.getelementbyid("rso")
        Set objH3 = objResultDiv.getelementsbytagname("H3")(0)
        Set link = objH3.getelementsbytagname("a")(0)


        str_text = Replace(link.innerHTML, "<EM>", "")
        str_text = Replace(str_text, "</EM>", "")

        Cells(i, 2) = str_text
        Cells(i, 3) = link.href
        DoEvents
    Next

    end_time = Time
    Debug.Print "end_time:" & end_time

    Debug.Print "done" & "Time taken : " & DateDiff("n", start_time, end_time)
    MsgBox "done" & "Time taken : " & DateDiff("n", start_time, end_time)
End Sub

This is the code I used to filter out URLs based on string values:

Sub badURLs()
    Dim lr As Long ' Declare the variable
    lr = Cells(Rows.Count, 3).End(xlUp).Row ' Set the variable
    ' lr now contains the last used row in column A

    Application.ScreenUpdating = False

    For a = lr To 1 Step -1
        If InStr(1, Cells(a, 3), "bloomberg", vbTextCompare) > 0 _
        Or InStr(1, Cells(a, 3), "manta", vbTextCompare) > 0 _
        Or InStr(1, Cells(a, 3), "yellowpages", vbTextCompare) > 0 _
        Or InStr(1, Cells(a, 3), "yelp", vbTextCompare) > 0 _
        Or InStr(1, Cells(a, 3), "snapshot", vbTextCompare) > 0 _
        Or InStr(1, Cells(a, 3), "facebook", vbTextCompare) > 0 _
        Or InStr(1, Cells(a, 3), "wiki", vbTextCompare) > 0 _
        Or InStr(1, Cells(a, 3), "linkedin", vbTextCompare) > 0 _
        Or InStr(1, Cells(a, 3), "hoovers", vbTextCompare) > 0 Then


        'Compares for bloomberg, wiki, or hoovers. Enters loop if value is greater than 0
            With Cells(a, 3)
                .NumberFormat = "General"
                .Value = "NA"
            End With
        End If
    Next a

    Application.ScreenUpdating = True
End Sub

Just to reiterate: I want to know if its possible (and if so how) to filter out URLs in the first macro based on the string values in the second. I'm hoping that this will allow me to have much more accurate URL hits and that I wont have to search 3000 company names manually in hopes that only a few will have a useful URL.

Community
  • 1
  • 1
Brayheart
  • 167
  • 2
  • 16
  • 1
    Consider fully qualifying your Cells references so that it's always clear which sheet they are from; if a different sheet is selected for any reason then your code will look in the wrong place. And try setting `Cells(i, 3) = CStr(link.href)` in the first macro, just in case Excel's doing something daft with the hyperlink – Dave Jun 14 '16 at 17:09
  • What does your second macro do then? Doesn't that filter them? Or have you started the macro but need help finishing? Could you alternatively sort the URLs in Alpha. order? That would put all `Facebook` together, all `bloomberg` together, etc. for quicker deleting/filtering. – BruceWayne Jun 14 '16 at 17:10
  • @Dave If I change the link.href to a string value can I filter the URL search using InStr to tell the macro to select a URL that doesn't contain the string values in the second macro? – Brayheart Jun 14 '16 at 17:26
  • @BruceWayne The issue is that of the 13,000 URLs I have, 3000 of them come up from 3rd party sites and are not the URL I want. I'm trying to make the first macro more accurate in its URL retrieval by telling it to search for a Link/URL that does not contain the string values in the second macro – Brayheart Jun 14 '16 at 17:26

1 Answers1

1

EDIT TO DEMONSTRATE USAGE:

I'm copying your XMLHTTP() code in its entirety below, then I'm adding the user defined function below that to demonstrate how your module is laid out. The change I make really only affects one like: Cells(i, 3) = href. In this case, if href is in the bad list of URLS, nothing will be put in Cells(i, 3). If you need more complex business logic, let us know and we'll try to help.

Sub XMLHTTP()

    Dim url As String, lastRow As Long
    Dim XMLHTTP As Object, html As Object, objResultDiv As Object, objH3 As Object, link As Object
    Dim start_time As Date
    Dim end_time As Date

    lastRow = Range("A" & Rows.Count).End(xlUp).Row

    Dim cookie As String
    Dim result_cookie As String

    start_time = Time
    Debug.Print "start_time:" & start_time

    For i = 2 To lastRow

        url = "https://www.google.co.in/search?q=" & Cells(i, 1) & "&rnd=" & WorksheetFunction.RandBetween(1, 10000)

        Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
        XMLHTTP.Open "GET", url, False
        XMLHTTP.setRequestHeader "Content-Type", "text/xml"
        XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
        XMLHTTP.send

            Set html = CreateObject("htmlfile")
        html.body.innerHTML = XMLHTTP.ResponseText
        Set objResultDiv = html.getelementbyid("rso")
        Set objH3 = objResultDiv.getelementsbytagname("H3")(0)
        Set link = objH3.getelementsbytagname("a")(0)


        str_text = Replace(link.innerHTML, "<EM>", "")
        str_text = Replace(str_text, "</EM>", "")

        Cells(i, 2) = str_text
        If funcBadUrls(Cells(i, 1)) then
            Cells(i, 3) = "" 
        Else    
            Cells(i, 3) = link.href
        End If
        DoEvents
    Next

    end_time = Time
    Debug.Print "end_time:" & end_time

    Debug.Print "done" & "Time taken : " & DateDiff("n", start_time, end_time)
    MsgBox "done" & "Time taken : " & DateDiff("n", start_time, end_time)
End Sub

Function funcBadURLs(sInput as String) as Boolean
    Dim bResult as Boolean
        If InStr(1, sInput, "bloomberg", vbTextCompare) > 0 _
        Or InStr(1, sInput, "manta", vbTextCompare) > 0 _
        Or InStr(1, sInput, "yellowpages", vbTextCompare) > 0 _
        Or InStr(1, sInput, "yelp", vbTextCompare) > 0 _
        Or InStr(1, sInput, "snapshot", vbTextCompare) > 0 _
        Or InStr(1, sInput, "facebook", vbTextCompare) > 0 _
        Or InStr(1, sInput, "wiki", vbTextCompare) > 0 _
        Or InStr(1, sInput, "linkedin", vbTextCompare) > 0 _
        Or InStr(1, sInput, "hoovers", vbTextCompare) > 0 Then
            bResult = True
        Else
            bResult = False
        End If
     funcBadUrls = bResult
End Sub

If I understand correctly, you want to ignore BadUrls in the first subroutine. If so, consider creating a Function based on the second routine that simply returns true if bad, and false if not. Then you can build logic as needed. For example:

Function funcBadURLs(sInput as String) as Boolean
    Dim bResult as Boolean
        If InStr(1, sInput, "bloomberg", vbTextCompare) > 0 _
        Or InStr(1, sInput, "manta", vbTextCompare) > 0 _
        Or InStr(1, sInput, "yellowpages", vbTextCompare) > 0 _
        Or InStr(1, sInput, "yelp", vbTextCompare) > 0 _
        Or InStr(1, sInput, "snapshot", vbTextCompare) > 0 _
        Or InStr(1, sInput, "facebook", vbTextCompare) > 0 _
        Or InStr(1, sInput, "wiki", vbTextCompare) > 0 _
        Or InStr(1, sInput, "linkedin", vbTextCompare) > 0 _
        Or InStr(1, sInput, "hoovers", vbTextCompare) > 0 Then
            bResult = True
        Else
            bResult = False
        End If
     funcBadUrls = bResult
End Sub

To use it:

Sub Test()
    If funcBadUrls("www.bloomberg.com") then
        'Do whatever to skip
    Else
        MsgBox "Success"
    End If
End Sub

Let me know if that helps, or alternatively if I misunderstood your question.

basodre
  • 5,720
  • 1
  • 15
  • 23
  • @user3561814 You are correct I want to add the second subroutine to the first one. However if the result returns true (it finds a URL containing "bloomberg") I want it to keep searching for another URL. If I'm reading this correctly I am to apply my second macro to the first referencing it as a function? In order to use what you posted in your example am I to copy the first part into the second part where you put 'Do whatever to skip'? – Brayheart Jun 14 '16 at 17:40
  • 1
    @Brayheart They essentially serve as two separate subroutines. You can copy the `Function` code and paste it below your current sub `XMLHTTP()` (after the end sub). Then you can reference it like any other built-in `VBA` function. So, if you're reading from cells in the worksheet in a loop, pass the value of the cell to the `funcBadUrls(cells(i, 1)). If it returns true, continue searching. If it returns false, you can record the value and move on in the loop. – basodre Jun 14 '16 at 17:47
  • @user3561814 Thats so cool! thankyou for showing me this! I've put `If funcBadURLs = True Then` under `Cells(i, 3) = link.href` however I don't know how to tell the macro to "Keep Searching". Any ideas? – Brayheart Jun 14 '16 at 18:07
  • @user3561814 If its not too much trouble could you show me how you would reference the function you posted and my 1st macro? For some reason my referencing isn't functioning. – Brayheart Jun 14 '16 at 18:33
  • @Brayheart I'd have no problem demonstrating it, but I'm not sure I follow your code well enough to do so. Do you want to simply skip cells that have a "bad url" in them, and move to the next cell? – basodre Jun 14 '16 at 19:28
  • @user3561814 Yea me too.. I actually got this code from here: http://stackoverflow.com/questions/17495644/using-vba-in-excel-to-google-search-in-ie-and-return-the-hyperlink-of-the-first I just want it to skip over the badURLs when retrieving the URLs from the web-browser. I dont even know if thats possible to be done. But I would love for you to show me how to reference outside functions in VBA I've never done that. The cell that pastes the URL the macro finds is at `Cells(i, 3) = link.href` if that helps.. – Brayheart Jun 14 '16 at 19:43
  • @Brayheart I edited my original post with more info. Let me know if it helps or if you need more help. – basodre Jun 14 '16 at 20:03
  • @user3561814 I would just like to say that its users like you that make SO so amazing. I tried executing everything together but I got an "out of memory" error. When debugging the macro I get a run time error 1004 Application-defined or object-defined error at the IF THEN statement in the function referenced. – Brayheart Jun 14 '16 at 20:28
  • @Brayheart Wow, I totally screwed up in my function, references the original cells instead of the parameter we passed in. I edited my post with the corrected version of the function. Give it a try and report back. – basodre Jun 14 '16 at 20:34
  • @user3561814 It works but it doesn't look like referenced function does anything? I should note that there is a line separating the two I'm not sure if that is of any significance. Also for the function it needs to end with "End Function" :) – Brayheart Jun 14 '16 at 22:19
  • You are correct, it should end with `End Function`. I copied and pasted from your code directly and didn't catch that. Also, updated again. I normally run tests on my side to make sure my code works, but I didn't have sample data. I originally input the value in `Cells(i, 3)` as a parameter to the function, when in fact, that was the destination. The input should have been `Cells(i, 1)`. Does it work better? – basodre Jun 15 '16 at 00:59
  • Sadly it does not. Is there a way I can upload some sample data for you? – Brayheart Jun 15 '16 at 18:20