2

With dynamic web pages that display a table of retrieved data, I’ve found that both MSXML2.XMLHTTP and the Internet Explorer object usually can’t access this data. A good example is https://www.tiff.net/tiff/films.html. Both techniques won’t retrieve any of the movie data – just the surrounding web page. The code I’ve tried is as follows:

Function getHTTP(ByVal sReq As String) As Variant
    On Error GoTo onErr
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", sReq, False
        .send
        getHTTP = StrConv(.responseBody, 64)
    End With
    Exit Function
    onErr:   MsgBox "Error " & Err & ": " & Err.Description, 49, "Error opening site"
End Function

Function GetHTML(ByVal strURL As String) As Variant
  Dim oIE As InternetExplorer
  Dim hElm As IHTMLElement
  Set oIE = New InternetExplorer
  oIE.Navigate strURL
  Do While (oIE.Busy Or oIE.ReadyState <> READYSTATE_COMPLETE)
     DoEvents
  Loop
  Set hElm = oIE.Document.all.tags("html").Item(0)
  GetHTML = hElm.outerHTML
  Set oIE = Nothing
  Set hElm = Nothing
End Function

But there is a way to manually retrieve the movie data – just follow these steps with Microsoft Edge or Internet Explorer:

Right-click on one of the movies 

Choose “inspect element." The DevTools console opens. 

At the bottom-left of the screen, click on the “html” tab. 

Right-click the tab.  Choose “copy.” 

Open notepad and paste what you’ve copied.

You now have the movie data and can save it to a file for parsing. My question: Is there any way to get this data programmatically?

QHarr
  • 83,427
  • 12
  • 54
  • 101
trevbet
  • 145
  • 1
  • 12

2 Answers2

1

Here are the film titles using IE (you can use same process to get directors)

Option Explicit
Public Sub GetFilms()
    Dim IE As New InternetExplorer, html As HTMLDocument, films As Object, i As Long
    With IE
        .Visible = True
        .navigate "https://www.tiff.net/tiff/films.html"

        While .Busy Or .readyState < 4: DoEvents: Wend
        Set films = .document.querySelectorAll("[target=_self]")

        For i = 0 To films.Length - 1
            Debug.Print films.item(i).innerText
        Next
        .Quit '<== Remember to quit application
    End With
End Sub

XHR is too fast for this, with the URL provided, but IE is just fine.

If you inspect the HTML you can see each film has the following commonality:

HTML

There is an attribute within the a tag called target whose value is _self.

You can use an attribute CSS selector to gather all of these matching elements using the querySelectorAll method of document.


CSS selector (sample):

Sample


I would be interested in if this can be solved for getting the film descriptions by parsing the HTML. I had thought the presence of the comments was obscuring the film descriptions. A regex which selects the text within these in theory "<!-- react-text: \d+ -->([^...].+?(?=<))" seems to fail when applied to the .innerHTML as did attempts to swop out the comment start and finish with regex.

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Can you show me how to use your routine to get additional fields besides the title? I can't observe "films" in the locals window, If I click on it, the program freezes then crashes. You can see Santosh's variables window and mine in this thread … how do I get his? – trevbet Aug 07 '18 at 15:35
  • It is a bug with Excel that you can't expand a static nodeList in the locals window (or iterate with For Each), which is what querySelectorAll returns. See [here](https://stackoverflow.com/questions/50021133/excel-crashes-when-attempting-to-inspect-dispstaticnodelist) – QHarr Aug 07 '18 at 15:53
  • 1
    For the last couple of months `.querySelector()` has got a drastic popularity (in reality it should) when it is about using the same within vba. You are one of the contributors to deserve the credit @QHarr. – SIM Aug 07 '18 at 21:01
  • @SIM You introduced me to it! It is a shame there is so much more to CSS selectors than we can use in VBA :-( – QHarr Aug 07 '18 at 21:03
  • Who introduced it is of a little value. What is important is how you flourish it. This is a recent trend (if you consider other languages as well) to use css selector even over xpath only because of it's specificity, robustness and the state of being less prone to break. – SIM Aug 07 '18 at 21:12
  • And yet CSS selectors can do so much more than what is allowed with querySelector. I find that a little disappointing. I'm switching more over to python so will see if that serves wider CSS selector syntax. – QHarr Aug 07 '18 at 21:14
  • 1
    Oh yeah!! That's because of pseudo-selectors. If we had scopes to use that then vba would have reached in a new level when it comes to do the scraping thing. – SIM Aug 07 '18 at 21:18
  • Qharr thanks for pointing me to that article with the code and examples. It looks powerful, but I've been at it all afternoon and can't get it to work with the pages I'm playing with. Could you please show me what in the code has to be changed to get something useful off of this page? https://www.tiff.net/tiff/a-star-is-born/ Say the director and the running time? – trevbet Aug 07 '18 at 23:28
1

Why Json? Because the page is loaded using json data

To View: Use Google Chrome --> Press F12 --> Load URL --> Goto Network tab

enter image description here


Code:

Sub getHTTP()

    Dim Url As String, data As String
    Dim xml As Object, JSON As Object, colObj, item


    Url = "https://www.tiff.net/data/films-events-2018.json?q=1513263947586"

    Set xml = CreateObject("MSXML2.ServerXMLHTTP")
    With xml
        .Open "GET", Url, False
        .send
        data = .responseText
    End With


    Set JSON = JsonConverter.ParseJson(data)
    Set colObj = JSON("items")

    For Each item In colObj
        Debug.Print item("title")
        Debug.Print item("description")

        For Each c1 In item("cast")
            Debug.Print c1
        Next

        For Each c2 In item("countries")
            Debug.Print c2
        Next
    Next
End Sub

Output

enter image description here


Installation of JsonConverter

  1. Download the latest release
  2. Import JsonConverter.bas into your project (Open VBA Editor, Alt + F11; File > Import File) Add Dictionary reference/class
  3. For Windows-only, include a reference to "Microsoft Scripting Runtime"
  4. For Windows and Mac, include VBA-Dictionary

Tree View of Data

enter image description here

Santosh
  • 12,175
  • 4
  • 41
  • 72
  • 1
    Thanks you for all of this excellent advice -- especially JSON. I've begun playing with it and like what I see. I just have a few questions: 1. Where did you come up with "https://www.tiff.net/data/films-events-2018.json?q=1513263947586"? 2. Why are only the title and description available in the colobj? Things like "cast" and "countries" should be available too, but the program gives you an error if you try to use anything other than the two you chose. 3. Where can I see the title and description values in the locals window? – trevbet Aug 06 '18 at 18:30
  • Glad you liked it. 1. Http watch window shows site is loaded using json data 2. Because they are collection and so needs to be looped (see tree view) – Santosh Aug 06 '18 at 18:56
  • 1
    Great answer + 1 – QHarr Aug 07 '18 at 09:21
  • 1
    OK, I'm there! I finally discovered the CQZpR.jpg page. Could you (or anyone) please confirm that the data on the current record is, in fact, not visible anywhere in the locals window? That routine of yours needed to have C1 and C2 declared as variants. Even then, I couldn't assign a retrieved value … only print it. I had to take "server" out of MSXML2.ServerXMLHTTP, then it worked. So now it looks like I'm in business! Thanks for your first-rate assistance. This is one of the most informative responses I've ever received in a technical forum, and that's saying something! – trevbet Aug 07 '18 at 17:03
  • 1. Yes, C1 and C2 should be declared as variants 2. Data is visible in local window – Santosh Aug 07 '18 at 17:12
  • Are there some JSON data blocks that you can't make a column object out of? I'm trying https://www.tiff.net/data/films/accidence.json?q=1533655820587. jsonstring contains what looks like a similar block of data, but I cannot do a set colObj off of it using either "items" or "pitch". Also, how did you generate your tree view of JSON? – trevbet Aug 07 '18 at 20:00
  • I figured out what the problem was … since the JSON block was only one collection rather than a series of them, I just had to leave the JSON variable alone, rather than assign it to a column object and step through it (i.e. JSON("title") rather than JSON.item("title"). I'd REALY like to know where you got that JSON tree diagram. – trevbet Aug 09 '18 at 00:05
  • One other thing: that "q" parameter for JSON URL's … I have to step through all of the movies on that menu page to collect information, and all of them have a different parameter. Is there any way to find it them out programmatically rather than using the devtools in Google Chrome one at a time? None of the parameters are present on either the menu page or the individual pages. How does the SERVER know to load the JSON file for an individual movie? – trevbet Aug 09 '18 at 00:08