How to select then when scraping data from web page

Question

I have been learning about scraping data from web pages using Excel and VBA. I cam across a bit of a barrier with one data source so changed to https://www.skyscanner.net/news/airports/heathrow-airport-live-flight-arrivals-and-departures. The problem I am coming up against is in the table id "flight-status-board-arrivals" there is a and . I can pull the header data fine but when I try bulling the body data the loop ends. I have tried changing to .children(1) to test and debugging using length which returns 0. This indicates there is nothing in the object but I cant understand why. I hope I have covered everything here and any help would be possible. Also, I know this could be achieved using another language however as I am learning VBA I thought it best to learn using VBa before delving into a new language.`

Sub GrabWebData()

Dim ie As InternetExplorer 'refer to the running copy of internet explorer
Dim html As HTMLDocument 'refer to the HTML document returned
Dim ele As Object
Dim y As Integer
Dim fSht As Worksheet

Set fSht = Sheets("Sheet1")
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "https://www.skyscanner.net/news/airports/heathrow-airport-live-flight-arrivals-and-departures"

'wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Loading Flight Times"
DoEvents
Loop

y = 1

Debug.Print ie.document.getElementById("flight-status-board-arrivals").Children(1) _
.getElementsByTagName("td").Length


For Each ele In ie.document.getElementById("flight-status-board- arrivals").Children(1) _
.getElementsByTagName("tr")
Debug.Print ele.textContent
fSht.Range("A" & y).Value = ele.Children(0).textContent
'On Error GoTo skip1:
fSht.Range("B" & y).Value = ele.Children(1).textContent
'On Error GoTo skip1:
fSht.Range("C" & y).Value = ele.Children(2).textContent
'On Error GoTo skip1:
fSht.Range("D" & y).Value = ele.Children(3).textContent
'On Error GoTo skip1:
fSht.Range("E" & y).Value = ele.Children(4).textContent
'On Error GoTo skip1:
fSht.Cells.WrapText = False
fSht.Rows.AutoFit
fSht.Columns.AutoFit
'skip1:
y = y + 1
Next

'Rows(2).Select
'Selection.Delete shift:=xlUp

End Sub`

You are not getting anything except for the headers just because the data within `tbody` load very late or never load in IE. I waited for 5 minutes to see what happens, but could not find the data getting loaded. In such cases `selenium` might be a good way to go with. — SIM, Jun 23 '18 at 13:22
@SIM thanks for that, I will check it out. I was thinking that the reason it doesn't load is because the data on the site in tbody is updating regularly but I thought there should still be data there to load. Could that be the case? — TheRunner83, Jun 23 '18 at 13:36
If you can't see any visible text out there (within tbody) using IE, how can the script fetch that? However, I will check it out too!! — SIM, Jun 23 '18 at 13:44
@SIM do you mean on the actual page using IE? If so the text/data shows on the page. If you check the “td” in tbody and watch the “tr” I think it is, these seem to scramble on the refresh. That’s why I thought it could be the issue. Look forward to hear your findings — TheRunner83, Jun 23 '18 at 14:11
Check out [the link](https://www.dropbox.com/s/y4jyxfbgemntu3j/Untitled.jpg?dl=0) to understand what I meant. — SIM, Jun 23 '18 at 15:00

QHarr · Answer 1 · 2018-07-28T06:55:19.130

1

The following should get you started. It is uses selenium basic. After installing, you need to add a reference to selenium type library and HTML Object library.

Was in a hurry so I will come back and refine later.

Option Explicit

Public Sub GetInfo()
    Dim d As WebDriver, hTable As HTMLTable, html As HTMLDocument, doc As WebElement, headers(), b As Object
    headers = Array("Flight Details", "Status", "Scheduled Time", "Airline Flight", "Origin", "Terminal", "Status")
    Set d = New ChromeDriver
    Const URL = "https://www.skyscanner.net/news/airports/heathrow-airport-live-flight-arrivals-and-departures"
    Application.ScreenUpdating = False
    With d
        .Start "Chrome"
        .Get URL
        Set html = New HTMLDocument
        Set b = .FindElementById("flight-status-board-arrivals") '<== Only used to take advantage of implicit waits in Selenium. This is a TODO improve.
        html.body.innerHTML = .findElementByXPath("//body").Attribute("innerHTML")

        Set hTable = html.getElementById("flight-status-board-arrivals")

        WriteTable hTable, headers

        .Quit
        Application.ScreenUpdating = True
    End With
End Sub

Public Sub WriteTable(ByVal hTable As HTMLTable, ByRef headers As Variant, Optional ByVal startRow As Long = 1, Optional ByVal ws As Worksheet)
    If ws Is Nothing Then Set ws = ActiveSheet

    Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long, tBody As Object
    r = startRow
    With ws
        Set tRow = hTable.getElementsByTagName("tr") 'HTMLTableRow
        For Each tr In tRow
            Set tCell = tr.getElementsByTagName("td")
            For Each td In tCell                 'DispHTMLElementCollection
                .Cells(r, c).Value = td.innerText 'HTMLTableCell
                c = c + 1
            Next td
            r = r + 1:  c = 1
        Next tr
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
    End With
End Sub

Current sample snapshot of webpage:

Current sample snapshot of code output:

Note:

When inspecting the page additional info appears (front 2 columns before visible columns shown in screenshot above):

Code captures all of these.

I was interested in seeing if I could transfer innerHTML from webElements, so as to use the properties of .document, or other HTML DOM objects. I found the way to do it in an answer by @Nerijus.

edited Jul 28 '18 at 06:55

answered Jun 23 '18 at 15:19

QHarr

83,427
12
54
101

It's a very nice piece of code. Genuinely awesome!! – SIM Jun 23 '18 at 16:35
Thanks. Needs a little tweaking. Maybe you can sort how one creates a new html table? I’d rather load the html into that than the entire document HTML. Was in a hurry so didn’t get a chance to look into it. If you solve please post as answer or whichever method you use! – QHarr Jun 23 '18 at 16:36
The way you load the html is mindblowing. I just forgot to mention that. It's always a pleasure to learn new things. – SIM Jun 23 '18 at 16:39
@QHarr thanks for this. One thing though, after getting Selenium and chrome driver, I am struggling to get past the Set d = New ChromeDriver line. Any ideas? I will continue to search but any tips would be great. – TheRunner83 Jun 23 '18 at 19:34
What is the problem? Do you get an error message of some sort? What is it? And is it on that line? – QHarr Jun 23 '18 at 19:39
@SIM Seems I would have to do something like document.createElement("TABLE") to have HTMLTable so no real point trying for HTMLTable object. I could // to the id of the table though and load just that like I did with HTMLBody. Not sure it would be a massive improvement..... – QHarr Jun 23 '18 at 20:16
@QHarr the error I get is an automation error on the line Set d = New ChromeDriver. I will install the newest versions of ChromeDriver and see if that works. – TheRunner83 Jun 23 '18 at 21:53
Yes. This can occur if you have an older version of chromedriver. – QHarr Jun 24 '18 at 04:02
Check out my update answer @QHarr. It might be useful for future usage. Actually, I found it very nice. Thanks. – SIM Jul 02 '18 at 09:55
@SIM Will do! Thanks for letting me know. – QHarr Jul 02 '18 at 09:56

SIM · Accepted Answer · 2018-07-02T09:53:45.250

1

Try the following code to get the data from that table. I've defined Explicit Wait within the script so that it will wait until the tabular data has been available in that webpage. Although I would strongly suggest you to follow the way QHarr has already shown, you can give this a shot additionally. The script will run headlessly so you won't see any browser. However, the only problem is that you may encounter a stale element error somewhere within the operation, not always though.

This is how the script looks like:

Sub FetchData()
    Const link As String = "https://www.skyscanner.net/news/airports/heathrow-airport-live-flight-arrivals-and-departures"
    Dim posts As Object, post As Object, elem As Object, R&, C&

    With New ChromeDriver
        .AddArgument "--headless"
        .get link
        Set posts = .FindElementByCss("#flight-status-board-arrivals tbody tr", Timeout:=30000)
        For Each post In .FindElementsByCss("#flight-status-board-arrivals tr")
            For Each elem In post.FindElementsByCss("th,td")
                C = C + 1: Cells(R + 1, C) = elem.Text
            Next elem
            C = 0: R = R + 1
        Next post
    End With
End Sub

To make the execution time a lot faster and revert back to HTMLDocument parser reusing the .PageSource, you should try like below.

Sub FetchData()
    Const link As String = "https://www.skyscanner.net/news/airports/heathrow-airport-live-flight-arrivals-and-departures"
    Dim posts As Object, post As Object, elem As Object, R&, C&
    Dim Html As New HTMLDocument

    With New ChromeDriver
        .AddArgument "--headless"
        .get link
        Set posts = .FindElementByCss("#flight-status-board-arrivals tbody tr", timeout:=30000)
        Html.body.innerHTML = .PageSource  'this is how you can go
    End With

    For Each post In Html.getElementById("flight-status-board-arrivals").Rows
        For Each elem In post.Cells
            C = C + 1: Cells(R + 1, C) = elem.innerText
        Next elem
        C = 0: R = R + 1
    Next post
End Sub

Reference to add to the library before execution:

Selenium Type Library
Microsoft HTML Object Library

edited Jul 02 '18 at 09:53

answered Jun 23 '18 at 20:56

SIM

21,997
5
37
109

thanks for this. I just tried to run this but it still returned n automation error. I included the Selenium library and have the chromedriver installed. I stepped through using F8 and it got to with new chromedriver and returned the error. I am using excel 2010 if that any help or has any bearing on the issue. – TheRunner83 Jun 23 '18 at 21:31
Can you run any script with the version of selenium you have installed, I meant have you checked any url to play with? – SIM Jun 23 '18 at 21:34
I encountered the same issue for the first time. However, the easy fix is to find any latest version of `chromedriver.exe`, then download it and replace it with the one currently available in your installation folder (program files > selenium basic). – SIM Jun 23 '18 at 21:39
Check out the edit. I've kicked out the hardcoded delay and applied `Explicit Wait` in place. – SIM Jun 24 '18 at 13:57
Still having issues with the new chrome line. Will have a look and see what I can sort out and try the code – TheRunner83 Jun 24 '18 at 14:00
I have tried installing the ChromeDriver again and still experiencing the runtime error of automation error. I have included in the original post the references I have included in excel. Any ideas of where to go next? – TheRunner83 Jun 24 '18 at 18:28
Did you try replacing `chromedriver.exe` which is meant to be downloaded from another source. – SIM Jun 24 '18 at 18:34
S I have installed Selenium Basic and then installed the ChromeDriver but unable to find the file path so I can replace the .exe file. Checked in application data where a lot of posts advise to go but there is nothing there apart from a windows folder. Odd. I also installed the IDE hoping that might make a difference but it hasn't. – TheRunner83 Jun 24 '18 at 19:07
Found the folder and replaced file but still getting the error. Starting the run out of ideas – TheRunner83 Jun 24 '18 at 19:26
I tried few of that `exes` to get the chromedriver run successfully. – SIM Jun 24 '18 at 19:30
Try this. If it is possible for you to download. I'm using this now. [Download from here](https://www.dropbox.com/s/u632niniwh6sfd2/chromedriver.exe?dl=0) – SIM Jun 24 '18 at 19:34
Added the driver you provided to the folder and once again returned the error (which I have now added to the original post) – TheRunner83 Jun 24 '18 at 19:42
I have managed to get this to work. .net framework had not been installed. Just tested by opening and closing a chrome browser window and worked a dream. Thank you for your help with this. – TheRunner83 Jun 25 '18 at 21:32
Html.body.innerHTML = .PageSource +1 nice Didn't know about the headless bit either though makes sense. – QHarr Jul 02 '18 at 09:57

How to select then when scraping data from web page

2 Answers2