0

I've used this simple subroutine for loading HTML documents from the web for some time now with no problems:

Function GetSource(sURL As String) As Variant

' Purpose:   To obtain the HTML text of a web page
' Receives:  The URL of the web page
' Returns:   The HTML text of the web page in a variant

Dim oXHTTP As Object, n As Long

Set oXHTTP = CreateObject("MSXML2.XMLHTTP")
oXHTTP.Open "GET", sURL, False
oXHTTP.send
GetSource = oXHTTP.responsetext
Set oXHTTP = Nothing

End Function

but I've run into a situation where it only loads part of a page most of the time (not always -- sometimes it loads all of the expected HTML code). If you SAVE the HTML of the page to another file on the web from a browser, the subroutine will always read it with no problem.

I'm guessing that the issue is timing -- that the dynamic page registers "done" while a script is still filling in details. Sometimes it completes in time, other times it doesn't.

Has anyone ever encountered this behavior before and surmounted it? It seems that there should be a way of capturing via the MSXML2.XMLHTTP object exactly what you'd get if went to the page and chose the save to HTML option.

If you'd like to see the behavior for yourself, here's a sample of a page that doesn't load consistently:

http://www.tiff.net/festivals/thefestival/programmes/specialpresentations/mr-turner

and here's a saved HTML file of that same page:

http://tofilmfest.ca/2014/film/fest/Mr_Turner.htm

Is there any known workaround for this?

trevbet
  • 145
  • 1
  • 12
  • You talk of a *dynamic page* but XMLHTTP will simply download whatever the server returns unlike a browser which will execute any content modifying script etc, could this be what your seeing? - You should also examine the requests `.status` to ensure it executed as expected. What parts of the page are missing? Is it truncated? (no – Alex K. Jul 24 '14 at 10:59
  • I thought that if the varasync parameter of the OPEN request was set to false, the request would execute until all data has been downloaded (I.E. the status is "complete"). The sample page is one of many similar pages, and what's missing is the unique content of each page. I can post a file of what should be captured and what the MSXML.HTTP control returns if you'd like to see the exact difference. – trevbet Jul 24 '14 at 13:13
  • *the request would execute until all data has been downloaded* - yes that's correct. Are you aware that the page is fetching data from an Ajax call that wont be downloaded automatically by XMLHTTP? Look in your browsers Network Monitor and see if thats whats missing – Alex K. Jul 24 '14 at 13:17
  • This stuff; http://tiffdailydiscovery.com/rest?appid=930634242d0e16d0bce8df1f4913e318&type=json&method=getSocialFeed&data=dHlwZT1vZmZpY2lhbA== – Alex K. Jul 24 '14 at 13:18

2 Answers2

2

I found a workaround that gives me what I want. I control Internet Explorer programmatically and invoke a three-second delay after I tell it to navigate to a page to enable the content to finish loading. Then I extract the HTML code by using an IHTMLElement from Microsoft's HTML library. It's not pretty, but it retrieves all of the HTML code for every page I've tried it with. If anybody has a better way accomplishing the same end, feel free to show off.

Function testbrowser() As Variant
   Dim oIE As InternetExplorer
   Dim hElm As IHTMLElement
   Set oIE = New InternetExplorer
   oIE.Height = 600
   oIE.Width = 800
   oIE.Visible = True
   oIE.Navigate "http://www.tiff.net/festivals/thefestival/programmes/galapresentations/the-riot-club"
   Call delay(3)
   Set hElm = oIE.Document.all.tags("html").Item(0)
   testbrowser = hElm.outerHTML
End Function

Sub delay(ByVal secs As Integer)
   Dim datLimit As Date
   datLimit = DateAdd("s", secs, Now())
   While Now() < datLimit
   Wend
End Sub
trevbet
  • 145
  • 1
  • 12
  • You can wait for IE to complete its stuff without using a fixed delay and pegging the CPU; http://stackoverflow.com/questions/19334880/ie-busy-not-working-well-vba – Alex K. Jul 25 '14 at 10:53
1

Following Alex's suggestion, here's how to do it without a brute force fixed delay:

Function GetHTML(ByVal strURL as String) As Variant
  Dim oIE As InternetExplorer
  Dim hElm As IHTMLElement
  Set oIE = New InternetExplorer
  oIE.Navigate strURL
  Do While (oIE.Busy Or oIE.ReadyState <> READYSTATE_COMPLETE)
     DoEvents
  Loop
  Set hElm = oIE.Document.all.tags("html").Item(0)
  GetHTML = hElm.outerHTML
  Set oIE = Nothing
  Set hElm = Nothing
End Function
trevbet
  • 145
  • 1
  • 12