-1

In Excel VBA to load a website and get it into a sheet I have been using the following:

Dim IE As Object  
Set IE = CreateObject("InternetExplorer.Application")  
IE .navigate "https://www.wsj.com/market-data/bonds/treasuries"

And then I can copy and paste it into my Excel sheet. But this website no longer works with IE11, and Excel VBA insists on using IE11 even though it is about to be deprecated.

Is there another way? I have also looked at:

  • Selenium: but it seems to be pretty much obsolete for VBA (not updated since 2016) and I couldn’t get it to work with Edge or Firefox in VBA anyway.

  • AutoIt: I got it to write the website’s HTML code to a TXT file (oHTTP = ObjCreate("winhttp.winhttprequest.5.1") ; $oHTTP.Open("GET", $URL1, False) ; $oHTTP.Send(); $oReceived = $oHTTP.ResponseText; FileWrite($file, $oReceived)) but the txt file contents are far from convenient as there is endless HTML stuff in it. It’ll take a fair amount of VBA code to sort through the mess, which probably means it won’t be reliable going forward. Also given the size of my workbook which is very slow, it will take literally several minutes to copy the website data into a sheet element by element.

Surely there must be an easy way to load the site, or just the table within the site, into an Excel sheet? This must be a well trodden path, but after much googling I can’t find an easy solution that actually works.

I have a 5-10 web pages being loaded into this workbook, and it seems to be a full time job keeping the whole thing working!! Any thoughts/help very much appreciated!!!

drb01
  • 189
  • 1
  • 2
  • 17
  • 1
    SeleniumBasic is old, thats right. But it works with the newest WebDrivers. Look in the following link at the answer of YasserKhalil, how to install SeleniumBasic and the WebDriver: https://stackoverflow.com/questions/57216623/using-google-chrome-in-selenium-vba-installation-steps – Zwenn Jul 03 '21 at 12:57
  • 2
    What data are you looking for in the website given in your question? I tried with XMLHTTP and although the HTML code may seem endless but if you are looking for Treasury Notes and Bonds or Treasury Bills data, it exists as a JSON string within the HTML document that you can extract it from and parse it. – Raymond Wu Jul 03 '21 at 13:12
  • (1) I did download the latest drivers and Selenium still wouldn’t work, even work with Chrome. But TBH I am trying to keep Chrome off my PC altogether and use Brave / Firefox / Edge. Guess I better start from scratch again but if there a good explanation somewhere on how to use Brave/Firefox/Edge that is current? And is Selenium going to be working for a good few years yet? – drb01 Jul 03 '21 at 17:40
  • (2) Yes, I want the treasury data. I have got the HTML string into a VBA string (600kB!)? Is there an easy way to get the text into a sheet, or to parse it? I can parse it with some VBA code but it just seems like a real hack rather than an elegant solution. (3) So many people must want to do this sort of thing and the built in Excel functionality seems to be virtually obsolete and rubbish. There must be a more elegant solution?! – drb01 Jul 03 '21 at 17:40
  • 1
    @Zwenn It works with some of the newest webdrivers. I think FF and Opera and PhatomJS might be no-no's now. – QHarr Jul 03 '21 at 21:56
  • You are probably right. I am old school – I know C backwards, and a reasonable amount of C++. I need to modernise but there is soooo much stuff out there I don’t even know where to start. Is Python the way? And realistically will it help a lot for this simple web scraping type stuff? I've managed one way or another with basic VBA and endless web searches for code snippets.. But logging into sites etc can be a problem... – drb01 Jul 04 '21 at 08:24
  • 1
    @QHaar I tried it myself with Chrome and PhantomJS. With PJS actually only to try out screenshots of entire websites. I tried it with FF and Edge at the very beginning without success, but I hadn't yet found YasserKhalil's installation instructions. It works very well with Chrome, so I didn't explore the other browsers any further. – Zwenn Jul 04 '21 at 11:05
  • Look here for basic information for Python https://docs.python.org/3/tutorial/ One of many other pages is these https://www.learnpython.org/ Here is a page espacialy for web scraping with python https://realpython.com/beautiful-soup-web-scraper-python/ – Zwenn Jul 04 '21 at 11:05

2 Answers2

1

The following code (not using web drivers) works but isn't an easy solution. I was able to find the information stored within the body, which was isolated by using REGEX and then stored into a JSON file for parsing.

Dim XMLPage As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim strPattern As String: strPattern = "window.__STATE__ = ({.+}}}});"
Dim JSON As Object
Dim Key As Variant
Dim key1, key2 As String

XMLPage.Open "GET", "https://www.wsj.com/market-data/bonds/treasuries", False
XMLPage.send

Set JSON = JsonConverter.ParseJson(REGEX(XMLPage.responseText, strPattern, "$1"))

' Notes and Bonds
key1 = "mdc_treasury_{" & """" & "treasury" & """" & ":" & """" & "NOTES_AND_BONDS" & """" & "}"

For Each Key In JSON("data")(key1)("data")("data")("instruments")
    Debug.Print Key("maturityDate")
    Debug.Print Key("ask")
    Debug.Print Key("askYield")
    Debug.Print Key("bid")
    Debug.Print Key("change")
Next Key

 ' Bills
key2 = "mdc_treasury_{" & """" & "treasury" & """" & ":" & """" & "BILLS" & """" & "}"

For Each Key In JSON("data")(key2)("data")("data")("instruments")
    Debug.Print Key("maturityDate")
    Debug.Print Key("ask")
    Debug.Print Key("askYield")
    Debug.Print Key("bid")
    Debug.Print Key("change")
Next Key

The following function will need to be copied into a module:

Function REGEX(strInput As String, matchPattern As String, Optional ByVal outputPattern As String = "$0") As Variant
    Dim inputRegexObj As New VBScript_RegExp_55.RegExp, outputRegexObj As New VBScript_RegExp_55.RegExp, outReplaceRegexObj As New VBScript_RegExp_55.RegExp
    Dim inputMatches As Object, replaceMatches As Object, replaceMatch As Object
    Dim replaceNumber As Integer

    With inputRegexObj
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .Pattern = matchPattern
    End With
    With outputRegexObj
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .Pattern = "\$(\d+)"
    End With
    With outReplaceRegexObj
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
    End With

    Set inputMatches = inputRegexObj.Execute(strInput)
    If inputMatches.Count = 0 Then
        REGEX = False
    Else
        Set replaceMatches = outputRegexObj.Execute(outputPattern)
        For Each replaceMatch In replaceMatches
            replaceNumber = replaceMatch.SubMatches(0)
            outReplaceRegexObj.Pattern = "\$" & replaceNumber

            If replaceNumber = 0 Then
                outputPattern = outReplaceRegexObj.Replace(outputPattern, inputMatches(0).value)
            Else
                If replaceNumber > inputMatches(0).SubMatches.Count Then
                    'regex = "A to high $ tag found. Largest allowed is $" & inputMatches(0).SubMatches.Count & "."
                    REGEX = CVErr(xlErrValue)
                    Exit Function
                Else
                    outputPattern = outReplaceRegexObj.Replace(outputPattern, inputMatches(0).SubMatches(replaceNumber - 1))
                End If
            End If
        Next
        REGEX = outputPattern
    End If
End Function

The following resources will help:

How to use Regular Expressions (Regex) in Microsoft Excel both in-cell and loops

https://github.com/VBA-tools/VBA-JSON

You will need to install the JSON converter and reference Regular Expression in the library. The REGEX function was found elsewhere on stack overflow so someone else deserves the credit for it.

1

Similar idea to Christopher's answer in using regex. I am grabbing the instruments data (JS array), splitting the component dictionaries out (minus the end }), and then use regex, based on headers, to grab the appropriate values.

I use a dictionary to handle input/output headers, and set a couple of request headers to help to signal browser based request and to mitigate for being served cached results.

Ideally, one would use an html parser and grab the script tag, then use a json parser on the JavaScript object within the script tag.

If you want the data from the other tabbed results, I can add that in by explicitly setting re.Global = True, then looping the returned matches. Depends whether you want those and how you want them to appear in the sheet(s).

I currently write results out to a sheet called Treasury Notes & Bonds.


Option Explicit

Public Sub GetTradeData()
    Dim s As String, http As MSXML2.XMLHTTP60 'required reference Microsoft XML v6,
    
    Set http = New MSXML2.XMLHTTP60

    With http
        .Open "GET", "https://www.wsj.com/market-data/bonds/treasuries", False
        .setRequestHeader "User-Agent", "Mozilla/5.0"
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        s = .responseText
    End With
    
    Dim re As VBScript_RegExp_55.RegExp 'required reference Microsoft VBScript Regular Expressions
    
    Set re = New VBScript_RegExp_55.RegExp
    re.Pattern = "instruments"":\[(.*?)\]"
    s = re.Execute(s)(0).SubMatches(0)
    
    Dim headers() As Variant, r As Long, c As Long, mappingDict As Scripting.Dictionary 'required reference Microsoft Scripting Runtime
    
    Set mappingDict = New Scripting.Dictionary
    mappingDict.Add "maturityDate", "MATURITY"
    mappingDict.Add "coupon", "COUPON"
    mappingDict.Add "bid", "BID"
    mappingDict.Add "ask", "ASKED"
    mappingDict.Add "change", "CHG"
    mappingDict.Add "askYield", "ASKED YIELD"
    
    headers = mappingDict.keys
    
    Dim results() As String, output() As Variant, key As Variant
    
    results = Split(s, "}")
    ReDim output(1 To UBound(results), 1 To UBound(headers) + 1)
    
    For r = LBound(results) To UBound(results) - 1
        c = 1
        For Each key In mappingDict.keys
            re.Pattern = "" & key & """:""(.*?)"""
            output(r + 1, c) = re.Execute(results(r))(0).SubMatches(0)
            c = c + 1
        Next
    Next
    
    re.Pattern = "timestamp"":""(.*?)"""
    re.Global = True
    
    With ThisWorkbook.Worksheets("Treasury Notes & Bonds")
        
        .UsedRange.ClearContents
        
         Dim matches As VBScript_RegExp_55.MatchCollection
         
         Set matches = re.Execute(http.responseText)
        .Cells(1, 1) = matches(matches.Count - 1).SubMatches(0)
        .Cells(2, 1).Resize(1, UBound(headers) + 1) = headers
        .Cells(3, 1).Resize(UBound(output, 1), UBound(output, 2)) = output
    End With
End Sub
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Both these solutions work fantastically. I am very grateful and would never have done something so elegant in a million years. I don’t need the Bills data thanks. I now have to spend some time working out what is really going on! One more question: the website has a timestamp for the treasury data, currently “ Friday, July 02, 2021”. Is there a clever way to get that too? Of course I could use InStr() but again that seems “wrong”? – drb01 Jul 04 '21 at 08:22
  • 1
    I have updated to return the timestamp requested. Please note that the use of ClearContents towards the end will empty the sheet's usedrange between runs. – QHarr Jul 04 '21 at 08:39
  • 1
    I know you stated why headers were added to the request but could you explain your reason again.... dumbed down a little for someone like myself? – Christopher Weckesser Jul 04 '21 at 13:08
  • 1
    Hi. Just pre-emptive measures. The User-Agent to simulate a browser request to the server and the other header to hopefully avoid being served cached (old) results rather than the latest. Usually more useful with frequently updating pages. @ChristopherWeckesser – QHarr Jul 04 '21 at 13:40
  • This is so much better than what I had been doing before – writing VBA to load IE11 and copy & paste into Excel – and I suspect much more reliable. I also scrape data from websites that require a login, e.g. www.ft.com . I have also done this with VBA/IE11 and it works as I can login using IE11 and the login is remembered by IE11 and therefore VBA. Is there a better way to do this avoiding IE11 (which is about to be deprecated anyway)? This problem is probably a whole different kettle of fish in which case don't worry.. – drb01 Jul 04 '21 at 14:11
  • 1
    Depends on the website. Authentication (login) is possible with many sites using xmlhttp requests (which is the type of browserless request shown above). https://codingislove.com/http-requests-excel-vba/ – QHarr Jul 04 '21 at 14:39
  • 1
    I figured with the User-Agent. I've seen people rotate Proxies and User-Agents to ensure their web scraping goes "uninterrupted". I had no idea about the second header though. Thanks for the info. @QHarr – Christopher Weckesser Jul 04 '21 at 16:08
  • @ChristopherWeckesser You've been researching and it shows. Well done. – QHarr Jul 04 '21 at 17:13
  • FWIW I added a bit of error handling to prevent runtime errors. The statement s = re.Execute(s)(0).SubMatches(0) seems prone to errors if no matches are found. So I broke it down: Dim Mats as Object: Set Mats = re.Execute(s) : if Mats.Count < 1 Then error_handling Else s=Mats(0).SubMatches(0) …. Needs to be done in a couple of places. Hope this is helpful for other people arriving at this thread – drb01 Jul 04 '21 at 18:55
  • Were you using this on another page? Can you provide a case where no matches found please? – QHarr Jul 04 '21 at 21:36
  • No, there wasn't a problem, your code is great. But this piece of code is part of a much bigger program that I leave running 24/7. So I just wanted to be 100% sure there were no runtime errors. E.g. sometimes websites aren't available, or update incorrectly, or change format without warning. I noticed if the .responseText wasn't as expected a runtime error would occur... – drb01 Jul 05 '21 at 10:31
  • 1
    Then you should test the status code returned and also use if re.test(...) then do something else exit sub etc...... rather than count of matches. – QHarr Jul 05 '21 at 16:56
  • Thanks, I have added re.test(). Actually I do find that usually .responseText is about 600kB. But occasionally, maybe 1 in 10 times, it is about 1MB, and then there are no matches and re.test() fails. I have no idea why this happens. – drb01 Jul 06 '21 at 12:07
  • I think because response hasn't fully loaded. Not sure if will definitely fix but consider changing async to True then adding a do until loop immediately after .send waiting for .readystate to be = 4. – QHarr Jul 06 '21 at 16:26
  • Unfortunately it hasn't fixed it. Frankly it's more of a curiosity than a problem - I can just loop until re.test(..) is true – drb01 Jul 06 '21 at 17:23
  • 1
    I think it is just that the site is returning errors for some reason. If I look in .responseText it is full of things like "errorMessage":"Unable to display bonds". Not a problem, thanks for all the help – drb01 Jul 06 '21 at 18:02