VBA webscraper - Return InnerHTML with regex

Question

Using Excel VBA, i have to scrape some data from this website.

Since the relevant website objects dont contain an id, I cannot use HTML.Document.GetElementById.

However, I noticed that the relevant information is always stored in a <div>-section like the following:

<div style="padding:7px 12px">Basler Versicherung AG &#214;zmen</div>

Question: Is it possible to construct a RegExp that, probably in a Loop, returns the contents inside <div style="padding:7px 12px"> and the next </div>?

What I have so far is the complete InnerHtml of the container, obviously I need to add some code to loop over the yet-to-be-constructed RegExp.

Private Function GetInnerHTML(url As String) As String
    Dim i As Long
    Dim Doc As Object
    Dim objElement As Object
    Dim objCollection As Object

On Error GoTo catch
   'Internet Explorer Object is already assigned
   With ie
        .Navigate url
        While .Busy
            DoEvents
        Wend
        GetInnerHTML = .document.getelementbyId("cphContent_sectionCoreProperties").innerHTML
    End With
    Exit Function
catch:
    GetInnerHTML = Err.Number & " " & Err.Description
End Function

Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — arco444, May 25 '18 at 12:16
It would help if you showed some examples of expected output. Are you after "Die Eingabe darf höchstens 255 Zeichen lang sein" ? — QHarr, May 25 '18 at 12:44
@MartinDreher Take a look [here](https://stackoverflow.com/a/35782811/2165759) and [here](https://stackoverflow.com/a/41538937/2165759). — omegastripes, May 25 '18 at 12:53
@arco444 read it before posting my question, wondering how it qualifies as the highest rated answer, since "you can't" instead of "you very likely shouldn't" is misleading. Anyhow, `(\
).*?(?=\<\/div\>)` with an added `lookbehind` works as intended. — Martin Dreher, May 28 '18 at 06:03
@omegastripes upvoted both your answers as they match my question. The `.NextSibling` approach (below, Ryan and Sim) seems to good not to use tho ;) — Martin Dreher, May 28 '18 at 06:14

score 2 · Answer 1 · answered May 25 '18 at 12:40

I don't think you need Regular expressions to find the content on the page. You can use the relative positions of the elements to find the content I believe you are after.

Code

Option Explicit

Public Sub GetContent()
    Dim URL     As String: URL = "https://www.uid.admin.ch/Detail.aspx?uid_id=CHE-105.805.649"
    Dim IE      As Object: Set IE = CreateObject("InternetExplorer.Application")
    Dim Labels  As Object
    Dim Label   As Variant
    Dim Values  As Variant: ReDim Values(0 To 1, 0 To 5000)
    Dim i       As Long

    With IE
        .Navigate URL
        .Visible = False

        'Load the page
        Do Until IE.busy = False And IE.readystate = 4
            DoEvents
        Loop

        'Find all labels in the table
        Set Labels = IE.document.getElementByID("cphContent_pnlDetails").getElementsByTagName("label")

        'Iterate the labels, then find the divs relative to these
        For Each Label In Labels
            Values(0, i) = Label.InnerText
            Values(1, i) = Label.NextSibling.Children(0).InnerText
            i = i + 1
        Next

    End With

    'Dump the values to Excel
    ReDim Preserve Values(0 To 1, 0 To i - 1)
    ThisWorkbook.Sheets(1).Range("A1:B" & i) = WorksheetFunction.Transpose(Values)

    'Close IE
    IE.Quit
End Sub

`.NextSibling.Childen(0)`... it's really that easy! Thanks a lot Ryan! I am however inclined to use *SIM*s `XMLHTTP` approach for performance reasons (IE was the only possibility I was aware of). — Martin Dreher, May 28 '18 at 06:23
Kept it consistent with your given code, a Web Request approach is a better method to be sure. — Ryan Wildry, May 28 '18 at 13:21
Ye, I noticed that you probably kept most of my code on purpose, which is a good move for improving step-by-step — Martin Dreher, May 29 '18 at 08:13

score 2 · Accepted Answer · answered May 25 '18 at 18:57

Another way you can achieve the same using XMLHTTP request method. Give it a go:

Sub Fetch_Data()
    Dim S$, I&

    With New XMLHTTP60
        .Open "GET", "https://www.uid.admin.ch/Detail.aspx?uid_id=CHE-105.805.649", False
        .send
        S = .responseText
    End With

    With New HTMLDocument
        .body.innerHTML = S
        With .querySelectorAll("#cphContent_sectionCoreProperties label[id^='cphContent_ct']")
            For I = 0 To .Length - 1
                Cells(I + 1, 1) = .Item(I).innerText
                Cells(I + 1, 2) = .Item(I).NextSibling.FirstChild.innerText
            Next I
        End With
    End With
End Sub

Reference to add to the library before executing the above script:

Microsoft HTML Object Library
Microsoft XML, V6.0

Very similar to *Ryan Wildry*s excellent idea! Using a `XMLHTTP` request seems boots performance by quite a bit. Thanks for this additional piece of info! — Martin Dreher, May 28 '18 at 06:26

VBA webscraper - Return InnerHTML with regex

2 Answers2