0

Before I ask my question, I'm an amateur coder with basically no meaningful experience beyond VBA in ms office apps (I know - noob!)

I'm trying to create a web scraper using VBA to import data into excel and as per my comments in the below extract of code, the best I've been able to find on this is was in the winning answer to this question.

Below, I'm using investing.com as an example but in reality my project will be across multiple sites and will feed into a matrices which will be updating daily and self cannibalizing as events expire - For this reason I'd rather front-up the workload on the code side to make the inputs on an ongoing basis as minimal as possible (for me).

With that in mind, can I ask if there's a way to do any of the following (brace yourself, this will be cringe-worthy basic knowledge for some):

  1. Is there a way in which I can and navigate to a url and run a for each loop on every table on that page (without have a known id for any)? this is to speed up my code as much as it's to minimise my inputs as there'll be quite a bit of data to be updated and I was planning on putting a 2 minute looping trigger on the refresh.

  2. Instead of doing what I've been doing below, is it possible to reference a table, rather than a row, and do something along the lines of Cells(2,5).value to return the value within row 1, column 4? (assuming that both the array indexing starts at 0 in both dimensions?) Further to that, my first column (my primary key in some ways) may not be in the same order on all sources so is there a way in which I could do the equivalent to Columns("A:A").Find(What:=[Primary key], After:=Cells(1, 1), LookIn:=xlValues, LookAt:=xlWhole, SearchOrder:=xlByRows, SearchDirection:=xlNext, MatchCase:=True, SearchFormat:=False).Row to find what row within the table relates to the even I'm looking for?

Code :

Sub Scraper()
Dim appIE, allRowOfData As Object

' As per https://stackoverflow.com/questions/27066963/scraping-data-from-website-using-vba

Set appIE = CreateObject("internetexplorer.application")

With appIE
   .Navigate "http://uk.investing.com/rates-bonds/financial-futures" 'Sample page
   .Visible = False
End With

Do While appIE.Busy
    Application.Wait (Now + TimeValue("0:00:01")) 'If page not open, wait a second befor trying again
Loop

Set allRowOfData = appIE.document.getElementById("pair_8907") 
'tr id="[ID of row within table]"
Dim myValue As String: myValue = allRowOfData.Cells(8).innerHTML 
'The 8 is the column number of the table 
'(note: column numbers start at 0 so the 9th column should have "8" entered here

Set appIE = Nothing

Range("A1").Value = myValue

End Sub
Community
  • 1
  • 1
Jeremy
  • 1,337
  • 3
  • 12
  • 26
  • Why am I getting done voted? – Jeremy Jan 25 '17 at 09:57
  • @R3uK Thanks for the edit - I couldn't format it to code for some reason – Jeremy Jan 25 '17 at 10:03
  • You'll need a line of normal text between lists and code! ;) Apparently you were down voted because your question is too broad, you may need to prioritize! ;) And ask other question later or elsewhere (ie, 3. isn't very clear and could be ask on SuperUser) – R3uK Jan 25 '17 at 10:07
  • 1
    Thanks @R3uK - I'll remove 3 and will just learn when I attempt to scrape! – Jeremy Jan 25 '17 at 10:20

1 Answers1

4

If you want to use Excel functions to navigate the tables why not dump the tables first onto a worksheet this code works for me

Option Explicit

Sub Scraper()
    Dim appIE As Object

    ' As per http://stackoverflow.com/questions/27066963/scraping-data-from-website-using-vba

    Set appIE = CreateObject("internetexplorer.application")

    With appIE
       .Navigate "http://uk.investing.com/rates-bonds/financial-futures" 'Sample page
       .Visible = True
    End With

    Do While appIE.Busy
        DoEvents
        Application.Wait (Now + TimeValue("0:00:01")) 'If page not open, wait a second befor trying again
    Loop

    'Debug.Print TypeName(appIE.document)

    Dim doc As Object 'MSHTML.HTMLDocument
    Set doc = appIE.document

    '* appIE busy is good but you need to wait for the whole document to completely load and initialise so use this
    While doc.readyState <> "complete"
        DoEvents
    Wend

    '* we can select all the tables because they share the same CSS class name
    Dim tablesSelectedByClass As Object 'MSHTML.HTMLElementCollection
    Set tablesSelectedByClass = doc.getElementsByClassName("genTbl")

    '* you can change this, it was just convenient for me to add sheets to my workbook
    Dim shNewResults As Excel.Worksheet
    Set shNewResults = ThisWorkbook.Worksheets.Add

    Dim lRowCursor As Long  '* this controls pasting down the sheet
    lRowCursor = 1

    Dim lTableIndexLoop As Long
    For lTableIndexLoop = 0 To tablesSelectedByClass.Length - 1

        Dim tableLoop As Object 'MSHTML.HTMLTable
        Set tableLoop = tablesSelectedByClass.Item(lTableIndexLoop)

        If LenB(tableLoop.ID) > 0 Then  '* there are some extra nonsense tables, this subselects

            Dim sParentColumn As String, objParentColumn As Object ' MSHTML.HTMLSemanticElement
            Set objParentColumn = FindMyColumn(tableLoop, sParentColumn) '* need to understand is table on left hand or right hand side

            Dim vHeader As Variant: vHeader = Empty
            If sParentColumn = "leftColumn" Then
                '* tables on the left have a preceding H3 element with the table's description
                Dim objH3Headers As Object
                Set objH3Headers = objParentColumn.getElementsByTagName("H3")
                vHeader = objH3Headers.Item(lTableIndexLoop).innerText
            Else
                '* tables on the right have a hidden attribute we can use
                vHeader = tableLoop.Attributes.Item("data-gae").Value
                If Len(vHeader) > 3 Then
                    vHeader = Mid$(vHeader, 4)
                    Mid$(vHeader, 1, 1) = Chr(Asc(Mid$(vHeader, 1, 1)) - 32)
                End If
            End If

            '* tables on the right do not have column headers
            Dim bHasColumnHeaders As Boolean
            bHasColumnHeaders = (tableLoop.ChildNodes.Length = 2)

            Dim vTableCells() As Variant   '* this will be our table data container which we will paste in one go
            Dim lRowCount As Long: lRowCount = 0
            Dim lColumnCount As Long: lColumnCount = 0
            Dim lDataHeadersSectionIdx As Long: lDataHeadersSectionIdx = 0
            Dim objColumnHeaders As Object: Set objColumnHeaders = Nothing

            If bHasColumnHeaders Then

                Set objColumnHeaders = tableLoop.ChildNodes.Item(0).ChildNodes.Item(0)

                lRowCount = lRowCount + 1

                lDataHeadersSectionIdx = 1
            Else
                lDataHeadersSectionIdx = 0
            End If

            Dim objDataRows As Object 'MSHTML.HTMLElementCollection
            Set objDataRows = tableLoop.ChildNodes.Item(lDataHeadersSectionIdx).ChildNodes
            lColumnCount = objDataRows.Item(0).ChildNodes.Length

            lRowCount = lRowCount + objDataRows.Length

            ReDim vTableCells(1 To lRowCount, 1 To lColumnCount) As Variant

            '* we have them get the column headers
            Dim lColLoop As Long
            If bHasColumnHeaders Then
                For lColLoop = 1 To lColumnCount
                    vTableCells(1, lColLoop) = objColumnHeaders.ChildNodes.Item(lColLoop - 1).innerText
                Next
            End If

            '* get the data cells
            Dim lRowLoop As Long
            For lRowLoop = 1 To lRowCount - VBA.IIf(bHasColumnHeaders, 1, 0)
                For lColLoop = 1 To lColumnCount
                    vTableCells(lRowLoop + VBA.IIf(bHasColumnHeaders, 1, 0), lColLoop) = objDataRows.Item(lRowLoop - 1).ChildNodes.Item(lColLoop - 1).innerText
                Next
            Next

            '* paste our table description
            shNewResults.Cells(lRowCursor, 1).Value2 = vHeader
            lRowCursor = lRowCursor + 1

            '* paste our table data
            shNewResults.Cells(lRowCursor, 1).Resize(lRowCount, lColumnCount).Value2 = vTableCells
            lRowCursor = lRowCursor + lRowCount + 1
        End If

    Next

End Sub

Function FindMyColumn(ByVal node As Object, ByRef psColumn As String) As Object
    '* this code ascends the DOM looking for "column" in the id of each node
    While InStr(1, node.ID, "column", vbTextCompare) = 0 And Not node.ParentNode Is Nothing
        DoEvents
        Set node = node.ParentNode
    Wend
    If InStr(1, node.ID, "column", vbTextCompare) > 0 Then
        Set FindMyColumn = node
        psColumn = CStr(node.ID)
    End If


End Function

By the way, if you trade a lot the brokers get rich and you get poor, brokerage charges really impact in long run.

S Meaden
  • 8,050
  • 3
  • 34
  • 65
  • You are both a Gentleman (or lady - can't tell!) and a scholar -Thank you for this! I work in financial services so aware of brokers fees etc. -The above was just an example site :) – Jeremy Jan 28 '17 at 20:21