0

I am trying to scrape data from a website without any luck. i manage to navigate through Elements but I haven't managed to get the information from the last Elements. Below is my code, any help would be appreciated.

Option Explicit

Sub Download_Historical_Data()
    
    Dim IE As InternetExplorer, doc As HTMLDocument
    Dim All_Matches, Match
    Dim All_Champions, Champion
    
    'Open Browser and download data
    Set IE = New InternetExplorer

    With IE
        .Visible = True
        .Navigate ("https://www.scorespro.com/soccer/results/")

        While .Busy Or .readyState < 4: DoEvents: Wend

        Set doc = .document

    End With

    Set All_Champions = doc.getElementById("matches-data").getElementsByClassName("compgrp")
        
        For Each Champion In All_Champions
            
            Set All_Matches = Champion.getElementsByTagName("table")
            
            For Each Match In All_Matches
            
                If Left(Match.className, 12) = "blocks gteam" Then
                    With Match
                        'All the info
                    End With
                        
                End If
                
            Next Match
            
        Next Champion

    IE.Quit
    Set IE = Nothing
               
End Sub

Sample on 9/8/19: enter image description here

Sample on 7/8/19: enter image description here

Output: enter image description here

The reason i have use as sample 2 different days is because there is a game with penalties and i want to include this as well.

Error 1004
  • 7,877
  • 3
  • 23
  • 46
  • 1
    So what is exactly is missing please? Did you manually set up those Excel views and now want the code to generate that output? You say _i manage to navigate through Elements but I haven't managed to get the information from the last Elements_ but there is nothing defined as _Elements_ I'm guessing you mean you can navigate through html elements but not get required info.... but I see no code that attempts to write out info. – QHarr Aug 09 '19 at 15:29
  • Exactly what I was about to ask ? Code doesn't give that kind of output. Do you wish to get that type of output ? – Mikku Aug 09 '19 at 15:33

2 Answers2

3

You don't need to automate a browser. If you inspect the network traffic when selecting a date you will see an XHR request for the info. You can use those details (in fact I shorten to just the required url params) to retrieve the page content.

The info is contained in table tag elements. The champion is in tables with class name blockBar, otherwise the info is for the row info as seen on page. In order to leverage querySelector (which is a method of HTMLDocument) to select the sub table level elements, by class name, for individual tables, I stick the individual table html into a surrogate html document variable; I then have access to querySelector again and so can write nice flexible/descriptive css selectors to match on elements.

The columns in your output all have nice descriptive class names in the XHR response, so you can use those to determine which column to write to. As score info may risk losing formatting on output I use a Select Case statement, to test for those css selectors, and append a single quote to preserve formatting on output.

I choose, for efficiency, to store all results in an array and write out in one go.

Option Explicit
Public Sub GetMatchInfo()
    Dim headers(), results(), r As Long, c As Long, ws As Worksheet, i As Long
    Dim champion As String, html As HTMLDocument, html2 As HTMLDocument, cssSelectors(), j As Long

    Set html = New HTMLDocument
    Set html2 = New HTMLDocument
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    headers = Array("Date", "Time", "Status", "Champion", "Home Team", "Full Time Score", "Away Team", "Half Time", "Penalties Score")
    cssSelectors = Array(".kick_t_dt", ".kick_t_ko", ".status", "champion", ".home", ".score_link", ".away", ".halftime", ".after_pen")

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.scorespro.com/soccer/ajax-calendar.php?mode=results&date=2019-08-07", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim tables As Object, selector As String

    Set tables = html.querySelectorAll("table")

    ReDim results(1 To tables.Length, 1 To UBound(headers) + 1)

    For i = 0 To tables.Length - 1
        If tables.item(i).className = "blockBar" Then
            champion = tables.item(i).innerText
        Else
            r = r + 1
            html2.body.innerHTML = tables.item(i).outerHTML
            On Error Resume Next
            For j = LBound(cssSelectors) To UBound(cssSelectors)
                selector = cssSelectors(j)
                Select Case selector
                Case ".score_link", ".halftime", ".after_pen"
                    results(r, j + 1) = "'" & html2.querySelector(cssSelectors(j)).innerText
                Case "champion"
                    results(r, j + 1) = champion
                Case Else
                    results(r, j + 1) = html2.querySelector(cssSelectors(j)).innerText
                End Select
            Next
            On Error GoTo 0
        End If
    Next
    ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
    ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub

Example sample output:

enter image description here


Using IE

Option Explicit
Public Sub GetMatchInfo()
    Dim headers(), results(), r As Long, c As Long, ws As Worksheet, i As Long
    Dim champion As String, html As HTMLDocument, html2 As HTMLDocument, cssSelectors(), j As Long

    Set html = New HTMLDocument
    Set html2 = New HTMLDocument
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    headers = Array("Date", "Time", "Status", "Champion", "Home Team", "Full Time Score", "Away Team", "Half Time", "Penalties Score")
    cssSelectors = Array(".kick_t_dt", ".kick_t_ko", ".status", "champion", ".home", ".score_link", ".away", ".halftime", ".after_pen")

    With CreateObject("InternetExplorer.Application")
        .Navigate2 "https://www.scorespro.com/soccer/results/"
        While .Busy Or .readyState <> 4: DoEvents: Wend
        Application.Wait Now + TimeSerial(0, 0, 2)
        html.body.innerHTML = .document.body.innerHTML
        .Quit
    End With

    Dim tables As Object, selector As String

    Set tables = html.querySelectorAll("table")

    ReDim results(1 To tables.Length, 1 To UBound(headers) + 1)

    For i = 0 To tables.Length - 1
        If tables.item(i).className = "blockBar" Then
            champion = tables.item(i).innerText
        Else
            r = r + 1
            html2.body.innerHTML = tables.item(i).outerHTML
            On Error Resume Next
            For j = LBound(cssSelectors) To UBound(cssSelectors)
                selector = cssSelectors(j)
                Select Case selector
                Case ".score_link", ".halftime", ".after_pen"
                    results(r, j + 1) = "'" & html2.querySelector(cssSelectors(j)).innerText
                Case "champion"
                    results(r, j + 1) = champion
                Case Else
                    results(r, j + 1) = html2.querySelector(cssSelectors(j)).innerText
                End Select
            Next
            On Error GoTo 0
        End If
    Next
     ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
    ws.Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • thanks for your time and effort. As i can see this scraping method is different for my own. Since i vary rarely use VBA for web scraping, i will carefully review tha answer to learn this method. Thanks a lot!!!!! – Error 1004 Aug 09 '19 at 16:48
  • 1
    @QHarr, how do you know all this stuff? I have read through your multiple posts. I always enjoy reading your comments and getting new insights from you. I'm not sure how you figured out all this stuff, but I'm curious to know, if you want to share a few learning tips. Thanks. – ASH Aug 09 '19 at 19:40
  • @QHarr i m still on a vacation so i do not manage to go through the code. However i try to remove the brackets enclosing the `Half Time` using `Replace(Replace(html2.querySelector(cssSelectors(j)).innerText, "(", ""), ")", "")`. What is your opinion? – Error 1004 Aug 21 '19 at 07:00
  • @QHarr i have another question. Is it possible to explain how you get the following part of you code because i m trying to open another page with this method i dont manage so far? `"https://www.scorespro.com/soccer/ajax-calendar.php?mode=results&date=2019-08-07"` – Error 1004 Sep 04 '19 at 17:02
  • I examined the web traffic for the page via network tab in dev tools See [1](https://stackoverflow.com/questions/56277464/how-to-import-a-table-from-web-page-with-div-class-to-excel/56279841#56279841) and [2](https://stackoverflow.com/a/56924071/6241235) – QHarr Sep 04 '19 at 17:11
  • @QHarr i have i question for you. I have noticed that array 'results' at last positions are empty. I try to identified why is that but i dont manage to find anything. Could you please help? – Error 1004 Jun 30 '20 at 18:24
  • I won't test as not sure scraping site is allowed but check if there are less headers than listed. Is an actual expected result missing? Otherwise, when redim results don't + 1 – QHarr Jun 30 '20 at 18:53
  • i think the problem occurs when the array increase from 1 To tables.Length. – Error 1004 Jun 30 '20 at 20:45
  • Depends is it last row and columns that are empty? tables.length should be ok unless table has an empty row – QHarr Jun 30 '20 at 20:47
  • many last rows & columns are empty. if think tables.length has more lines – Error 1004 Jun 30 '20 at 21:13
  • 1
    Check in the browser and look for empty rows using elements tab via F12 – QHarr Jun 30 '20 at 21:48
0

I wrote this in WSL (web scraping language) but basically you can edit the json to add any other fields (assuming all the football games). Once you got all the data, then you can either have it emailed to you or your web server.

GOTO www.scorespro.com/soccer/results/ >> 
EXTRACT {'time': '.kick_t', 'status':'.status',
         'home':'.home.uc', 'score':'.score', 'away':'.away', 'match':'a'} IN table tr

Explanation: it goes to that score page, and then pulls time, status, home, score, away fields for each table row via table tr and finally the match field which will come from the header bar table row. It will look like {'time':undefined, ...., 'match':'Armenia: Premier League'} along with other table row game schedules like {'time':'2019/8/21' ,..., 'match':undefined}. Just merge the JSON objects afterwards.

user299709
  • 4,922
  • 10
  • 56
  • 88