2

Using Excel VBA, i have to scrape some data from this website.

Since the relevant website objects dont contain an id, I cannot use HTML.Document.GetElementById.

However, I noticed that the relevant information is always stored in a <div>-section like the following:

<div style="padding:7px 12px">Basler Versicherung AG &#214;zmen</div>

Question: Is it possible to construct a RegExp that, probably in a Loop, returns the contents inside <div style="padding:7px 12px"> and the next </div>?

What I have so far is the complete InnerHtml of the container, obviously I need to add some code to loop over the yet-to-be-constructed RegExp.

Private Function GetInnerHTML(url As String) As String
    Dim i As Long
    Dim Doc As Object
    Dim objElement As Object
    Dim objCollection As Object

On Error GoTo catch
   'Internet Explorer Object is already assigned
   With ie
        .Navigate url
        While .Busy
            DoEvents
        Wend
        GetInnerHTML = .document.getelementbyId("cphContent_sectionCoreProperties").innerHTML
    End With
    Exit Function
catch:
    GetInnerHTML = Err.Number & " " & Err.Description
End Function
Martin Dreher
  • 1,514
  • 2
  • 12
  • 22
  • 1
    Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – arco444 May 25 '18 at 12:16
  • What, aside from the title, leads you to that conclusion? – Martin Dreher May 25 '18 at 12:26
  • Have you read the first sentence of the answer? – arco444 May 25 '18 at 12:32
  • It would help if you showed some examples of expected output. Are you after "Die Eingabe darf höchstens 255 Zeichen lang sein" ? – QHarr May 25 '18 at 12:44
  • 1
    @MartinDreher Take a look [here](https://stackoverflow.com/a/35782811/2165759) and [here](https://stackoverflow.com/a/41538937/2165759). – omegastripes May 25 '18 at 12:53
  • @arco444 read it before posting my question, wondering how it qualifies as the highest rated answer, since "you can't" instead of "you very likely shouldn't" is misleading. Anyhow, `(\
    ).*?(?=\<\/div\>)` with an added `lookbehind` works as intended.
    – Martin Dreher May 28 '18 at 06:03
  • @omegastripes upvoted both your answers as they match my question. The `.NextSibling` approach (below, Ryan and Sim) seems to good not to use tho ;) – Martin Dreher May 28 '18 at 06:14

2 Answers2

2

I don't think you need Regular expressions to find the content on the page. You can use the relative positions of the elements to find the content I believe you are after.

Code

Option Explicit

Public Sub GetContent()
    Dim URL     As String: URL = "https://www.uid.admin.ch/Detail.aspx?uid_id=CHE-105.805.649"
    Dim IE      As Object: Set IE = CreateObject("InternetExplorer.Application")
    Dim Labels  As Object
    Dim Label   As Variant
    Dim Values  As Variant: ReDim Values(0 To 1, 0 To 5000)
    Dim i       As Long

    With IE
        .Navigate URL
        .Visible = False

        'Load the page
        Do Until IE.busy = False And IE.readystate = 4
            DoEvents
        Loop

        'Find all labels in the table
        Set Labels = IE.document.getElementByID("cphContent_pnlDetails").getElementsByTagName("label")

        'Iterate the labels, then find the divs relative to these
        For Each Label In Labels
            Values(0, i) = Label.InnerText
            Values(1, i) = Label.NextSibling.Children(0).InnerText
            i = i + 1
        Next

    End With

    'Dump the values to Excel
    ReDim Preserve Values(0 To 1, 0 To i - 1)
    ThisWorkbook.Sheets(1).Range("A1:B" & i) = WorksheetFunction.Transpose(Values)

    'Close IE
    IE.Quit
End Sub
Ryan Wildry
  • 5,612
  • 1
  • 15
  • 35
  • `.NextSibling.Childen(0)`... it's really that easy! Thanks a lot Ryan! I am however inclined to use *SIM*s `XMLHTTP` approach for performance reasons (IE was the only possibility I was aware of). – Martin Dreher May 28 '18 at 06:23
  • Kept it consistent with your given code, a Web Request approach is a better method to be sure. – Ryan Wildry May 28 '18 at 13:21
  • Ye, I noticed that you probably kept most of my code on purpose, which is a good move for improving step-by-step – Martin Dreher May 29 '18 at 08:13
2

Another way you can achieve the same using XMLHTTP request method. Give it a go:

Sub Fetch_Data()
    Dim S$, I&

    With New XMLHTTP60
        .Open "GET", "https://www.uid.admin.ch/Detail.aspx?uid_id=CHE-105.805.649", False
        .send
        S = .responseText
    End With

    With New HTMLDocument
        .body.innerHTML = S
        With .querySelectorAll("#cphContent_sectionCoreProperties label[id^='cphContent_ct']")
            For I = 0 To .Length - 1
                Cells(I + 1, 1) = .Item(I).innerText
                Cells(I + 1, 2) = .Item(I).NextSibling.FirstChild.innerText
            Next I
        End With
    End With
End Sub

Reference to add to the library before executing the above script:

Microsoft HTML Object Library
Microsoft XML, V6.0
SIM
  • 21,997
  • 5
  • 37
  • 109
  • Very similar to *Ryan Wildry*s excellent idea! Using a `XMLHTTP` request seems boots performance by quite a bit. Thanks for this additional piece of info! – Martin Dreher May 28 '18 at 06:26