0

I want to import restaurant data like Restaurant name, phone number & website to excel but unfortunately I am getting only one page (first page) however I want data from any range which I define like page-1 to page-3 or page-2 to page-5 in separate sheets for each page. Sample output file is attached of what output I am getting for the time being. enter image description here

    Sub Webscraping()
        'Declaration
     Dim ie As InternetExplorer
     Dim ht As HTMLDocument
    'Initialization
     Set ie = New InternetExplorer
     ie.Visible = True

'Open a url
ie.navigate ("https://www.yellowpages.com/atlanta-ga/restaurants")

'Set ht = ie.document
'MsgBox ht.getElementsByClassName("ot_lrp_bname_free_center")

'Alternative Approach for wait

Do Until ie.readyState = READYSTATE_COMPLETE
    DoEvents
Loop

'Initialize the document

Set ht = ie.document

'Set elems = ht.getElementsByClassName("list-title")
Set elems = ht.getElementsByClassName("business-name")
'Set elems = ht.getElementsByClassName("website-lrp icon-link ot_lrp_website_text_free_center")

i = 1
For Each elem In elems
    Sheet1.Cells(i, 1).Value = elem.innerText
    i = i + 1

    'Debug.Print (elem.innerText)
Next

Set elems = ht.getElementsByClassName("phone primary")

i = 1
For Each elem In elems
    Sheet1.Cells(i, 2).Value = elem.innerText
    i = i + 1

   'Debug.Print (elem.innerText)
Next
Set elems = ht.getElementsByClassName("links")
i = 1
For Each elem In elems

    Set link = elem.ChildNodes.Item(0)
    Sheet1.Cells(i, 3).Value = link.href
    i = i + 1    
Next

'Set internetdata = ie.document
'Set div_result = internetdata.getElementById("ctl00_gvMain_ctl03_hlTitle")
'Set header_links = div_result.getElementsByTagName("a")
'For Each h In header_links
'Set link = h.ChildNodes.Item(0)
'Worksheets("Stocks").Cells(Range("L" & Rows.Count).End(xlUp).Row + 1, 12) = link.href
 End Sub

This is the work which have been done but struggling to get the required ouput

chris neilsen
  • 52,446
  • 10
  • 84
  • 123
M Faizan Farooq
  • 359
  • 1
  • 4
  • 14

2 Answers2

2

The pages are concatenated onto end of url. I would use xhr issues requests in a loop over given page range and regex out the json which contains the required info (it is in one of the script tags). This method is very fast and more than offsets use of regex. I also re-use objects where possible.

I use jsonconverter.bas to handle the json and parse out required info (there is a lot more info in the json including the reviews). After downloading the .bas and adding to a module called JsonConverter in your project, you need to go VBE > Tools > References > Add reference to Microsoft Scripting Runtime.

Helper functions are used to test whether page to write out to already exists or needs creating, as well as one to write out json results to an array and dump the array out in one go to sheet (efficiency gain). The structure is left so it is easy to extend the retrieved info if more info is desired e.g. the review.

There may be some work to do on ensuring works for pages that don't exist. I have simply used status code of response at present to filter these out.


NOTES:

As a sanity check I would use InternetExplorer to go to page 1 and extract the total results count. I would divide that by results per page (currently 30) to calculate total pages. This would give me lbound and ubound values (min and max for possible pages). Then switch to xmlhttp to actually retrieve. See additional helper function at end.


Code:

Option Explicit
Public Sub GetRestuarantInfo()
    Dim s As String, re As Object, p As String, page As Long, r As String, json As Object
    Const START_PAGE As Long = 2
    Const END_PAGE As Long = 4
    Const RESULTS_PER_PAGE As Long = 30

    p = "\[{""@context"".*?\]"
    Set re = CreateObject("VBScript.RegExp")

    Application.ScreenUpdating = False

    With CreateObject("MSXML2.XMLHTTP")

        For page = START_PAGE To END_PAGE
            .Open "GET", "https://www.yellowpages.com/atlanta-ga/restaurants?page=" & page, False
            .send
            If .Status = 200 Then
                s = .responseText
                r = GetValue(re, s, p)
                If r <> "Not Found" Then
                    Set json = JsonConverter.ParseJson(r)
                    WriteOutResults page, RESULTS_PER_PAGE, json
                End If
            End If
        Next
    End With
    Application.ScreenUpdating = True
End Sub
Public Sub WriteOutResults(ByVal page As Long, ByVal RESULTS_PER_PAGE As Long, ByVal json As Object)
    Dim sheetName As String, results(), r As Long, headers(), ws As Worksheet
    ReDim results(1 To RESULTS_PER_PAGE, 1 To 3)

    sheetName = "page" & page
    headers = Array("Name", "Website", "Tel")
    If Not WorksheetExists(sheetName) Then
        Set ws = ThisWorkbook.Worksheets.Add
        ws.Name = sheetName
    Else
        ThisWorkbook.Worksheets(sheetName).Cells.ClearContents
    End If
    With ws
        Dim review As Object
        For Each review In json  'collection of dictionaries
            r = r + 1
            results(r, 1) = review("name")
            results(r, 2) = review("url")
            results(r, 3) = review("telephone")
        Next
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        .Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With
End Sub

Public Function GetValue(ByVal re As Object, inputString As String, ByVal pattern As String) As String
'https://regex101.com/r/M9oRON/1
    With re
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .pattern = pattern
        If .Test(inputString) Then
            GetValue = .Execute(inputString)(0)
        Else
            GetValue = "Not found"
        End If
    End With
End Function

Public Function WorksheetExists(ByVal sName As String) As Boolean  '@Rory https://stackoverflow.com/a/28473714/6241235
    WorksheetExists = Evaluate("ISREF('" & sName & "'!A1)")
End Function

Helper function to return number of pages

'VBE > Tools > References: Microsoft Internet Controls
Public Function GetNumberOfPages(ByVal RESULTS_PER_PAGE As Long) As Variant
    Dim ie As Object, totalResults As Long
    On Error GoTo errhand
    Set ie = CreateObject("InternetExplorer.Application")
    With ie
        .Visible = False
        .Navigate2 "https://www.yellowpages.com/atlanta-ga/restaurants?page=1"

        While .Busy Or .readyState < 4: DoEvents: Wend

        With .document
            totalResults = Replace$(Replace$(.querySelector(".pagination  p").innerText, "We found", vbNullString), "results", vbNullString)
            GetNumberOfPages = totalResults / RESULTS_PER_PAGE
            ie.Quit
            Exit Function
        End With
    End With
errhand:
    If Err.Number <> 0 Then
        GetNumberOfPages = CVErr(xlErrNA)
    End If
End Function

Regex explanation:

Try it here.

enter image description here

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Wouldn't it be better to click the "Next" button and avoid the magic numbers `START_PAGE`, `END_PAGE` etc? This is what the popular cralwers are doing. – Vityata May 31 '19 at 12:43
  • The OP said he/she wants to define ranges. Otherwise, yes I would totally agree. My first thought was to use IE to grab the result count from page one and then divide by 30 to get max number of pages and move onto to do rest with xmlhttp. – QHarr May 31 '19 at 12:43
  • I guess the OP does not want what s/he needs. I guess s/he needs simply everything, thus this is hidden in "however I want data from any range which I define like page-1 to page-3 or page-2 to page-5 in separate sheets for each page." – Vityata May 31 '19 at 12:46
  • Thanks alot, i need to see last page to know how many pages i need to define to get total data, can you edit it in a way that if pages are from 1 to 31 & we write 1 to 50 in source code then it will only import 31 pages as of know it is giving error in this scenario & also want to bring url on front sheet1 cell A1 instead of editing it in source code. Thanks – M Faizan Farooq May 31 '19 at 18:34
  • Will do. That is why I added the other function that returns the total page count. – QHarr May 31 '19 at 18:49
  • And can I get all sheets in a single sheet. – M Faizan Farooq Jun 01 '19 at 11:21
0

The only way to do this with VBA is to check for the existence of a "Next" button and to click it, if it is there:

enter image description here

This is the HTML for it:

<a class="next ajax-page" href="/atlanta-ga/restaurants?page=2" data-page="2" data-analytics="{&quot;click_id&quot;:132}" data-remote="true" data-impressed="1">Next</a>

This is not "science fiction" to be done with VBA, however, there are commercial RPA solutions, which give "out of the box" functionalities exactly for this task - UiPath, AutomationAnywhere, BluePrism. Python's "beautiful soup" would do a rather good job as well.

Vityata
  • 42,633
  • 8
  • 55
  • 100
  • Is there is no way to do this using code get all pages data on one sheet at least. – M Faizan Farooq May 31 '19 at 12:08
  • @MFaizanFarooq - you can "click" through the code, if the "Next" is there. Take a look here - https://stackoverflow.com/questions/28153744/use-excel-vba-to-click-on-a-button-in-internet-explorer-when-the-button-has-no – Vityata May 31 '19 at 12:10
  • Can you please edit this code, as I am new to this excel vba therefore couldn't be able to pick your point. – M Faizan Farooq May 31 '19 at 12:18