2

My system is working with Windows 10 64bit and Office 2016

I am a bit familiar with Excel VBA, and I am trying to learn web scraping using VBA. Unfortunately, there is little information on in-depth digging on web data.

The website I want to scrape data from is bizbuysell.com from the seller's offers such as

Sample URL 1 Sample URL 2

There is a section that starts with the headline Detailed Information The HTML code is:

Detailed Information

<h3>Detailed Information</h3>
Location:
Pinellas County, FL
Inventory:
Included in asking price
Employees:
8 FT

I want to scrape data from this section.

The problem is that there are some 18 data labels and their respective values possible, but only those are shown for which the seller has entered data.

My idea was to search for all possible data labels and if they are not available then next data field

I tried it with the following code, but Obviously I made a mistake

For Each ele In doc.getElementsByClassName("listingProfile_details")
txt = ele.parentElement.innerText

If Left(txt, 8) = "Location" Then
    location = Trim(Mid(txt, InStrRev(txt, ":") + 1))
ElseIf Left(txt, 4) = "Inventory" Then
    inventory = Trim(Mid(txt, InStrRev(txt, ":") + 1))
.
.
.
End If

Next ele

I hope that someone can show me the correct VBA code to check for all 18 possible data labels and the respective data

Thank you so much! Tony

alphaService
  • 131
  • 1
  • 14
  • `Unfortunately, there is little information on in-depth digging on web data.` - citation needed. There is soooooooooo much info out there. Finding the better quality stuff, especially if new, is the problem. – QHarr Jun 24 '21 at 05:27
  • Is there a listing that has all 18 labels? – QHarr Jun 24 '21 at 05:57
  • QHarr - yes that is the problem – alphaService Jun 24 '21 at 11:22
  • Qharr - I did not find any listing which has data for all 18 labels yet. I think I need a code routine that will check for the availability of all labels and scrape the data of those that are available for a specific listing – alphaService Jun 24 '21 at 11:24
  • Can you [edit] the question to include the list of 18 labels, as text, as they appear on the website? – QHarr Jun 24 '21 at 11:35
  • Location Type Inventory Real Estate Building SF Building Status Lease Expiration Employees Furniture, Fixtures, & Equipment (FF&E) Facilities Competition Growth & Expansion Financing Support & Training Reason for Selling Franchise Home-Based Business Website – alphaService Jun 24 '21 at 19:22
  • Location | Type | Inventory | Real Estate | Building SF | Building Status | Lease Expiration | Employees | Furniture, Fixtures, & Equipment | FF&E) | | Facilities | Competition | Growth & Expansion | Financing | Support & Training | Reason for Selling | Franchise | Home-Based | Business Website Location | Type | Inventory | Real Estate | Building SF | Building Status | Lease Expiration | Employees | Furniture, Fixtures, & Equipment | FF&E) | | Facilities | Competition | Growth & Expansion | Financing | Support & Training | Reason for Selling | Franchise – alphaService Jun 24 '21 at 19:25
  • 1
    Home-Based | Business Website – alphaService Jun 24 '21 at 19:25

1 Answers1

1

One way it to gather a nodeList of the dt/dd elements and loop it with a step 2 so you can access the label at n indices and the value at n + 1.

To handle differing numbers of labels being present, you can initialise a fresh dictionary, with all the possible labels as keys, and the associated values as vbNullString, during the loop over urls, such that for each new XHR request you get a new dictionary ready to populate with the labels that are found. By using .Exists test, you only update the values for keys (labels) that are found at the current URI.

You can store all results in an array to write out to the sheet in one go at end.

There are lots of additional notes within the code.


Option Explicit

Public Sub GetDetailedBizBuySellInfo()
    Dim http As Object, urls() As Variant
    Dim html As MSHTML.HTMLDocument              'VBE > Tools > References > Microsoft HTML Object Library
    
    urls = Array("https://www.bizbuysell.com/Business-Opportunity/covid-friendly-commercial-cleaning-est-30-years-100k-net/1753433/?d=L2Zsb3JpZGEvaGlsbHNib3JvdWdoLWNvdW50eS1idXNpbmVzc2VzLWZvci1zYWxlLzI/cT1hVEk5T0RFc01qQXNNekFzTnpnbWJtRndQV1UlM0Q=", _
                 "https://www.bizbuysell.com/Business-Opportunity/Established-Cleaning-Business-Tampa-St-Pete/1849521/?utm_source=bizbuysell&utm_medium=emailsite&utm_campaign=shtmlbot&utm_content=headline")
                 
    Set http = CreateObject("MSXML2.XMLHTTP")
    Set html = New MSHTML.HTMLDocument
    
    Dim url As Long, results() As Variant
             
    ReDim results(1 To UBound(urls) + 1, 1 To 19) 'size the final output array. _
                                                  There will be the number of urls as row count, the number of labels as column count + 1 to store the url itself. You need to update the list of labels below. See GetBlankDetailedInformationDictionary   
    With http
    
        For url = LBound(urls) To UBound(urls)   'loop url list
            
            .Open "Get", urls(url), False
            .setRequestHeader "User-Agent", "Mozilla/5.0"
            .send
            
            html.body.innerHTML = .responseText
            
            Dim currentDetailedInformation As Scripting.Dictionary 'VBE > Tools > References > Microsoft Scripting Runtime
            
            Set currentDetailedInformation = GetCurrentDetailedInfo(html) 'use retrieved html to return a dictionary with key as dt > strong e.g.Location; value as dd e.g. Tampa, FL
   
            AddCurrentDetailedInfoToResults results, currentDetailedInformation, url, urls(url) 'url + 1 (zero indexed) will keep track of current row number to add to results
        Next
        
    End With
    
    With ActiveSheet                             'better to update with explicit sheet/be careful not to overwrite data already in a sheet
        .Cells(1, 1).Resize(1, UBound(results, 2)) = currentDetailedInformation.keys ' write out headers
        .Cells(1, UBound(results, 2)) = "Url"
        .Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results ' write out results
    End With
End Sub

Public Sub AddCurrentDetailedInfoToResults(ByRef results As Variant, ByVal currentDetailedInformation As Scripting.Dictionary, ByVal url As Long, ByVal currentUrl As String)
    
    Dim key As Variant, currentColumn As Long

    For Each key In currentDetailedInformation.keys
        currentColumn = currentColumn + 1        'increase column count to update results array with
        results(url + 1, currentColumn) = currentDetailedInformation(key)
    Next
    results(url + 1, currentColumn + 1) = currentUrl
End Sub

Public Function GetCurrentDetailedInfo(ByVal html As MSHTML.HTMLDocument) As Scripting.Dictionary
    ' Gathers a list of all the relevant dd, dt nodes within the passed in HTMLDocument.
    ' Requests a new blank dictionary whose keys are the labels (child strong element of dt tag)
    'Updates blank dictionary, per key, where present, with dd value in a loop of step 2 as list is strong, dd, strong, dd etc.....
    
    Dim updatedDictionary As Scripting.Dictionary, listOfLabelsAndValues As MSHTML.IHTMLDOMChildrenCollection
    
    Set updatedDictionary = GetBlankDetailedInformationDictionary
    'Css pattern to match the appropriate nodes
    Set listOfLabelsAndValues = html.querySelectorAll("#ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_listingDetails_dlDetailedInformation dt > strong, #ctl00_ctl00_Content_ContentPlaceHolder1_wideProfile_listingDetails_dlDetailedInformation dd")

    Dim currentIndex As Long
    
    For currentIndex = 0 To listOfLabelsAndValues.length - 2 Step 2 'nodeList is 0 index based
    
        'On Error Resume Next                     'key (label) may not be present for current html document _
        i.e. url so ignore errors when attempting to update blank dictionary via dt > strong matching on key. If label not found then value = vbNullString
        Dim key As String, value As String
        
        key = Trim$(listOfLabelsAndValues.Item(currentIndex).innerText)
        value = Trim$(listOfLabelsAndValues.Item(currentIndex + 1).innerText) 'as we are looping every 2 indices 0,2,4 ....
        If updatedDictionary.Exists(key) Then updatedDictionary(key) = value
        
        'On Error GoTo 0
    Next
    
    Set GetCurrentDetailedInfo = updatedDictionary ' return updated dictionary

End Function

Public Function GetBlankDetailedInformationDictionary() As Scripting.Dictionary

    Dim blankDictionary As Scripting.Dictionary, keys() As Variant, key As Long

    Set blankDictionary = New Scripting.Dictionary

    '' TODO Note: you would add in all 18 labels into array below.
    keys = Array("Location:", "Type:", "Inventory:", "Real Estate:", "Building SF:", _
                 "Building Status:", "Lease Expiration:", "Employees:", "Furniture, Fixtures, & Equipment (FF&E):", _
                 "Facilities:", "Competition:", "Growth & Expansion:", "Financing:", "Support & Training:", _
                 "Reason for Selling:", "Franchise:", "Home-Based:", "Business Website:")

    For key = LBound(keys) To UBound(keys)
        blankDictionary(keys(key)) = vbNullString 'add blank entry to dictionary for each label
    Next
    
    Set GetBlankDetailedInformationDictionary = blankDictionary
End Function
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • QHarr, thank you so much! I have to study your solution - it is completely new for me and I am sure I can learn a lot! – alphaService Jun 24 '21 at 23:52
  • Sub AddCurrentDetailedInfoToResults(ByRef results As Variant, ByVal currentDetailedInformation As Scripting.Dictionary, ByVal url As Long, ByVal currentUrl As String) ---> an error message was triggered: User-defined type not defined – alphaService Jun 24 '21 at 23:54
  • you need to add the two references indicated within the code. Microsoft HTML Object Library and Microsoft Scripting Runtime Alt +F11 > Tools > References – QHarr Jun 25 '21 at 01:06
  • I can answer any questions that you have – QHarr Jun 25 '21 at 01:07
  • 1
    Wow, wow, wow - the code works great and very fast. Now I have to study what you coded. Thank you so much! – alphaService Jun 25 '21 at 01:18
  • In your code, you insert the to be scraped urls in an array (urls = Array) - in my Excel sheet I have the urls in column A:A, each url in a new row. how could I get these urls into your array definition? I hope you do not mind me asking these silly questions - but I try to learn to code it myself. – alphaService Jun 25 '21 at 14:13
  • you read the column into a 2D array then transpose that into a 1D array. Example: https://stackoverflow.com/a/56984003/6241235 – QHarr Jun 25 '21 at 15:31
  • I tried, but I get an error "9" with line "redim" Set ws = ThisWorkbook.Worksheets("data1") With ws lastRow = .Cells(.Rows.Count, "A").End(xlUp).Row urls = Application.Transpose(.Range("A2:A" & lastRow).value) End With urls = Array(strArray) Set http = CreateObject("MSXML2.XMLHTTP") Set html = New MSHTML.HTMLDocument Dim url As Long ReDim results(1 To UBound(urls)) – alphaService Jun 25 '21 at 15:54
  • Please pastebin.com to share a link to formatted code as I can't read that. – QHarr Jun 25 '21 at 15:59
  • I didn't know about this website - I signed up and copied my code there. Here is the link to the code https://pastebin.com/2Grijzej – alphaService Jun 25 '21 at 16:27
  • Remove this line `urls = Array(strArray)` – QHarr Jun 25 '21 at 16:46
  • I did so an now the next Sub shows an index error for results – alphaService Jun 25 '21 at 18:48
  • QHarr, your code is so beautiful and highly effective, because it is running the scraping task so fast. May I ask you if you could add some lines of code to scrape the headline of the sales offer and the Asking Price and add it after the data for the "Labels" has been scraped? That would make the code perfect! Thanks for all the support you have given me already! – alphaService Jun 28 '21 at 14:40
  • Hi, Where is the latest code please? Pastebin it for me. – QHarr Jun 28 '21 at 15:44
  • Thank you so much! I have uploaded the latest code to https://pastebin.com/RfpbkapT – alphaService Jun 28 '21 at 17:15
  • QHarr, did you get my last comment and could you downlad the latest code from pastbin? – alphaService Jun 29 '21 at 15:24
  • I did and will look but busy with work at mo. – QHarr Jun 29 '21 at 18:18
  • QHarr, did you have time to look at the script if you can add headline and Asking Price to the data to be scraped? Thanks! – alphaService Jul 01 '21 at 11:31
  • Not yet as busy with work and helping others. You are next. Maybe this evening or Saturday. – QHarr Jul 01 '21 at 11:52
  • Works great - you are a genius! Thank you very much! – alphaService Jul 01 '21 at 19:37
  • For learning: I have to admit that I could not find the code with which you manage to scrape the Headline and the asking price? Please let me know how you did it – alphaService Jul 01 '21 at 19:38
  • I did more of a patch. https://pastebin.com/VeMy3CSg Do ctrl + F in the doc and search for the word "CHANGED" to see each new/altered line of code. – QHarr Jul 01 '21 at 19:56
  • Wow - that is a really elegant solution! I would have never come up with such beautiful code. I can only repeat myself - this code is a masterpiece! Not only that, but I learned a lot! Thank you – alphaService Jul 01 '21 at 20:49
  • Pleasure. Happy web-scraping. – QHarr Jul 01 '21 at 21:55