1

I am trying to find a way to get the data from yelp.com

I have a spreadsheet on which there are several keywords and locations. I am looking to extract data from yelp listings based on these keywords and locations already in my spreadsheet.

I have created the following code, but it seems to get absurd data and not the exact information I am looking for.

I want to get business name, address and phone number, but all I am getting is nothing. If anyone here could help me solve this problem.

Sub find()

Dim ie As Object
    Set ie = CreateObject("InternetExplorer.Application")
    With ie
        ie.Visible = False
        ie.Navigate "http://www.yelp.com/search?find_desc=boutique&find_loc=New+York%2C+NY&ns=1&ls=3387133dfc25cc99#start=10"
        ' Don't show window
    ie.Visible = False

    'Wait until IE is done loading page
    Do While ie.Busy
        Application.StatusBar = "Downloading information, lease wait..."
        DoEvents
    Loop

    ' Make a string from IE content
    Set mDoc = ie.Document
    peopleData = mDoc.body.innerText
    ActiveSheet.Cells(1, 1).Value = peopleData
End With

peopleData = "" 'Nothing
Set mDoc = Nothing
End Sub

1 Answers1

5

If you right click in IE, and do View Source, it is apparent that the data served on the site is not part of the document's .Body.innerText property. I notice this is often the case with dynamically served data, and that approach is really too simple for most web-scraping.

I open it in Google Chrome and inspect the elements to get an idea of what I'm really looking for, and how to find it using a DOM/HTML parser; you will need to add a reference to Microsoft HTML Object Library.

enter image description here

I think you can get it to return a collection of the <DIV> tags, and then check those for the classname with an If statment inside the loop.

I made some revisions to my original answer, this should print each record in a new cell:

Option Explicit
Private Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Sub find()
'Uses late binding, or add reference to Microsoft HTML Object Library 
'  and change variable Types to use intellisense
Dim ie As Object 'InternetExplorer.Application
Dim html As Object 'HTMLDocument
Dim Listings As Object 'IHTMLElementCollection
Dim l As Object 'IHTMLElement
Dim r As Long
    Set ie = CreateObject("InternetExplorer.Application")
    With ie
        .Visible = False
        .Navigate "http://www.yelp.com/search?find_desc=boutique&find_loc=New+York%2C+NY&ns=1&ls=3387133dfc25cc99#start=10"
        ' Don't show window
        'Wait until IE is done loading page
        Do While .readyState <> 4
            Application.StatusBar = "Downloading information, Please wait..."
            DoEvents
            Sleep 200
        Loop
        Set html = .Document
    End With
    Set Listings = html.getElementsByTagName("LI") ' ## returns the list
    For Each l In Listings
        '## make sure this list item looks like the listings Div Class:
        '   then, build the string to put in your cell
        If InStr(1, l.innerHTML, "media-block clearfix media-block-large main-attributes") > 0 Then
            Range("A1").Offset(r, 0).Value = l.innerText
            r = r + 1
        End If
    Next

Set html = Nothing
Set ie = Nothing
End Sub
David Zemens
  • 53,033
  • 11
  • 81
  • 130
  • 1
    This is a [busy waiting loop](http://stackoverflow.com/a/19019200/1768303), consider adding `Sleep(delay)` inside it, if handling `ie_DocumentComplete` is not possible. – noseratio Oct 11 '13 at 03:01
  • @Noseratio I just noticed that, actually, and changed the loop to `Do While .readyState <> 4`, also made some more tweaks to the code to be a mroe perfect solution. – David Zemens Oct 11 '13 at 03:08
  • Hmm, I don't see the change. I meant something like `DoEvents : Sleep(200)` (you'd need to `Declare Sub Sleep Lib "kernel32" Alias "Sleep" (ByVal dwMilliseconds As Long)` first if this is VBA), so it doesn't just eat CPU while waiting. Generally, `DoEvents` can lead to reentracy problems, here's a very good explanation [why](http://stackoverflow.com/a/5183623/1768303). – noseratio Oct 11 '13 at 03:17
  • 1
    +1 for the links, but I will leave that to OP to work out. I don't do any web scraping professionally, the little bit I tinker with is here on SO, but I've never needed to use the WinAPI `Sleep` function, the `.ReadyState <> 4` condition should handle the `ie_DocumentComplete`, no? – David Zemens Oct 11 '13 at 03:20
  • It does wait for IE document to finish loading. It's just if you open Task Manager, you'll see one of the CPU cores being 100% busy while cycling in the loop and checking this condition, until it's `true`. A short `Sleep` inside the loop would mitigate this CPU burden, but there would still be a loop. On the other hand, subscribing to `ie_DocumentComplete` event and handling it would make the logic asynchronous and would eliminate the loop. This would however require to re-factor the whole code. Otherwise, it's a great answer for up-voting, it's only missing that `Sleep` :) – noseratio Oct 11 '13 at 03:29
  • 1
    This particular loop executes so quickly that I don't notice any observable spikes in the CPU, either way, but your persistence and well-sourced links are enough to persuade me that this is probably the *optimal* way of handling it, so I've acquiesced :) – David Zemens Oct 11 '13 at 03:47
  • Glad you have a fast internet connection out there, so the document loads that quick. Say, I'm still on a copper ADSL+ :-) – noseratio Oct 11 '13 at 04:03
  • It must be quick enough. I think the `CreateObject` is slower. The whole macro runs in about 5 seconds. – David Zemens Oct 11 '13 at 04:04