XMLHTTP is pulling HTML from parent webpage

Question

I am trying to pull historical weather data from wunderground- originally I tried to scrape the web data off the main webpage using the standard HTML elements etc. etc.. however then I discovered that if you add "?format=1" to the URL a browser will display more or less a csv which is much easier to parse into data tables...

However I was still using IE browser which can be quite slow if you have a lot of data to pull so I started looking into using an XMLHTTP object to help pull the data faster which is where I ran into my issue- as I mentioned adding "?format=1" to the end of the URL would return a nice slim csv, but when I use the XMLHTTP it pulls the HTML of the original page not the CSV!

It seems to me like when the XMLHTTP sends its request, the response comes back from the 'parent' site if you will and not the csv that I specified... I noticed that it pulls the HTML from the URL that ends in .html, like it cuts off at .html and that's why the response is from the .html page and not the CSV page.. so I tried looking into URL protocols to see if there was a way to force the XMLHTTP to request the csv URL but I am pretty far out of my element and couldn't find anything..

Code below- appreciate any help you can give me:

Public Sub downloadWebDataToCSV(URL As String)

    Dim ADOStream As ADODB.Stream
    Dim XMLHTTP As MSXML2.XMLHTTP60

    Set XMLHTTP = New MSXML2.XMLHTTP60
    XMLHTTP.Open "GET", URL, False
    XMLHTTP.send

    If XMLHTTP.Status = 200 Then
        Set ADOStream = New ADODB.Stream
        With ADOStream
            .Open
            .Type = adTypeBinary
            .Write XMLHTTP.responseBody
            .SaveToFile "C:\datafile.csv", 2
            .Close
        End With
    End If

End Sub

Here are the parent and csv URLs as well if that helps-

Mainpage: https://www.wunderground.com/history/airport/KDAL/2003/10/15/DailyHistory.html CSV: https://www.wunderground.com/history/airport/KDAL/2003/10/15/DailyHistory.html?format=1

Thanks in advance, TheSilkCode

score 1 · Answer 1 · answered Feb 06 '17 at 17:49

I think you need to change the encoding of what is returned, then you can write directly to a CSV after cleaning up the format a bit. To change the format, you need to set a few request headers. See below.

Public Sub downloadWebDataToCSV()
    Dim URL As String: URL = "https://www.wunderground.com/history/airport/KDAL/2003/10/15/DailyHistory.html?format=1"
    Dim XMLHTTP As New MSXML2.XMLHTTP60

    With XMLHTTP
        .Open "GET", URL, False
        .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko"
        .setRequestHeader "content-type", "text/html; charset=UTF-8"
        .send
    End With

    If XMLHTTP.Status = 200 Then
        Dim fso As Object: Set fso = CreateObject("Scripting.FileSystemObject")
        Dim Fileout As Object: Set Fileout = fso.CreateTextFile("C:\users\megatron\desktop\vba.txt")
        Dim myText As String: myText = Replace(XMLHTTP.responseText, "<br />", vbCrLf)
        Fileout.Write myText
        Fileout.Close
    End If

End Sub

my man! Worked great- thanks for your help. Still unclear to me exactly why those headers solved my problem, do you know any good documentation on using headers? I've seen explanations regarding individual solutions but would like to understand the concept of what your sending and how that effects what you get back... Much appreciated — TheSilkCode, Mar 15 '17 at 10:02
I compared google chrome's headers and Internet Explorer's. I noticed `"content-type", "text/html; charset=UTF-8"` was different with chrome so I used that. It makes sense, you are asking for the result to be returned as Text. — Ryan Wildry, Mar 15 '17 at 15:44
Here's a link about headers: https://en.wikipedia.org/wiki/List_of_HTTP_header_fields — Ryan Wildry, Mar 15 '17 at 15:44

score 0 · Answer 2 · edited May 23 '17 at 12:17

You can get the data via Weather API by the example URL:

http://api.wunderground.com/api/c991975b7f4186c0/history_20031015/q/airport/KDAL.json

As you can see, the URL consists of three variable parts: API key c991975b7f4186c0, date 20031015 and location airport/KDAL. If you make XHR the response returns in JSON format.

How to obtain API key and parse JSON respons you can see in this answer.

XMLHTTP is pulling HTML from parent webpage

2 Answers2