Unable to get the content of the document using XMLHTTP request (Part 2)

Question

This is a follow-up question on my previous question, I was able to retrieve the content of the website with QHarr's help by .setRequestHeader "Cookie", "juLD4H3B=ABZHajF6AQAAH0KEfNV9kI1EEZg8m3BcrjBrBRN1ddwumUMKZVGciT2p_7ji" but this only lasted a day as I believe the cookie has expired.

I eventually found out that there was another request made to the website with additional Request Headers which will provide a response header with the cookie value if sent successfully.

I managed to figure out most of the required Request Headers as it is easily found in the first response:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9    
Accept-Language: en-GB,en;q=0.9
True-Client-IP: 165.225.112.130
Upgrade-Insecure-Requests: 1
X-Cloud-Trace-Context: cfcc69068c5cb2d847890a7547b3e941/1772772094880168808
X-EC-Hot-Hash: 7790000207959645976
x-ec-pop: sgb
X-EC-Session-ID: 88079078809787886379151172106634033866
X-EC-Uuid: 1570108802375324103115733450970686183758
X-Forwarded-For: 103.252.200.88, 165.225.112.130, 152.195.199.174, 34.102.254.51
X-Forwarded-Proto: https
X-Host: www.businesstimes.com.sg
fToAPHTNF0-f: AwvHZFF6AQAAy-A_IruEaP1KJTiiaipDPoplNAurzgyEgKa0yDReQsaYWX4hAaXhcIKucsP1wH8AAEB3AAAAAA==

What I can't figure out and am having trouble with, are these Request Headers:

fToAPHTNF0-a: FcpvG3-0vr3aA8Wo3_e0pX7pDZl24EiY8Z_p81aALmAGp_UbCYMqQFZJC_EVsQByFUoAWUXFHtv2tPyBGEBpX6XDGGvxMW2otawK-FTcSV84AFh_9q_hA7AT7EPMYMzRay8xkbRZT5g0q8T9YQJMRH5S14aPsLHbP5Qdhb7xVNR0gTL9LE_WWDzsyHyNz3Nc9oKm0pgbcM3yGA7g7U-sCcrvNSa7ITbrO2Z62mEbf6XShFUIJcPY63Kq7FyDpz1rB2L4ItGrZA3Tkfz5e5DwoIK6MIh-y4e5ob5qYtBDhkfV7uBbI-TuvLpe8HC6FjSxdP_hlEPxfJvkMf8sXSgrTaXXBwwRVBx5Yq3eBljwCjgNiLbVi6lesZVE3S0aj2Q3fDLTbyG79jys1awsPZ8jIPs9W0YSHUrKhi73umkOs3itvJkqnaw1Uf75IpTLnJ_n_ZGSp2u9pRZJBQUx2qZhhYm4tV6qnV8mkVUmg2D9FbECOH4RboTW9ON8A8lyvjoheZ5RuH-quwlGgXXqISTucrnGK2Tz7pqAC49yMH8qqc7EV7BHhjRhVp-eZFe6F7c72DrtXjjcm5fpLK-1F0MG08hZFbzthjrHTN8KvR2FcQ47rSF91izAQMGZ4rzIjGCuqPuZkdIjPLjq9tUA9KRkOs5YxSt6RalUqIGouBsYvcUJaHGJSJhzPowSVTs8mMUbY9wBZAB5G7Yn08JUHy4ZGf-Y-Fvnl0lcJr9v7yxmZSQSttEFqAT_prC3zoqzdeUuDOVWLqyUiC_oJKOA7_mcJzlMX8nnj--Iuq2Pij83rtbNDSvrXXCKi5UOCjrrV04XlFabt48MWPF0t8vrwHpM7_tE56P7IW3ZCYRPPpRHmMeJ72MwQooGtJnCJXq2Cq0itAB1GnodvyYpAhqtEzma49TB6NRSNN4U4JGiz787uaJg1pdavdOzdejbS1gh_7SDwxHo4JMhhOpEWKgCdzfTziYF0BeKshkSRJj3ejUq5cqEDg_MnqeEaWM_VBiYRtqXGK7nDNtDKPW1CV3NfX11kV9BeAXNakcJhYSh5Qk-kks0HBEmCU7uU4U8bvOThdIurVGFoDcPxZywmC3cwF0Kk_SM2dR3nuN1nMObGopLnGGIEzRh9uaIHFowYuSUYuuy0EdUjgYShYMhLSZLRCzf7dOFHndPOV-RXhG446hMDAGzLM6PIPBP18ugx4fE36l3wPvGK77Ki5eVjB8fK9l2wK1f820xUbCElL15cJNkfiQ9uicTW-QR5knEw5LEmHU92HePFUJh8qQmYAWmv9gU8eDrIJaoDlFDsgStH-erlNpiDcOxSCRVFBBq-gHcJaImucwSbvnxvvAmAGebThueOEzZAupc0P21W1Q2WijGPf6n2zqkG9BIhYEk0BhYm_1Jl2FlEOz1_EHRVHjoBycnXMFlHet6Wh_4MauDiKkM4FEehYDr-rSkyZUmRBphuIq
fToAPHTNF0-b: iyrw7f
fToAPHTNF0-c: AMDFYVF6AQAAbtw8T-EjslRuCNO9KkreSk7faXdYDWrgCCNd_bD_S_Jdp51-
fToAPHTNF0-d: AAaChAiBBKCMgUGASZAQgICQACKw_0vyXaedfv_____sbgLzAYpha0zTSuaEBn0oG8gz2gI    
fToAPHTNF0-z: q

For completeness, This link is the html document returned from the first response in the above sample.

My suspect is that it is within the minified script and there's no way for me to get the cookie without using a browser.

I appreciate all the help for this!

score 2 · Accepted Answer · edited Jul 03 '21 at 16:39

I tried using "POST" instead of "GET" and it worked for me. Here's a little bit of code that got the headers for each article. I didn't bother parsing the rest of the information that you might want.

Dim XMLPage As New MSXML2.XMLHTTP60
Dim HTMLDoc As New MSHTML.HTMLDocument
Dim ArticleTitle As Variant

XMLPage.Open "POST", "https://www.businesstimes.com.sg/keywords/singapore-parliament", False
XMLPage.send

HTMLDoc.body.innerHTML = XMLPage.responseText

For Each article In HTMLDoc.getElementsByClassName("widget__title")
    Debug.Print article.innerText
Next article

If you need to include a cookie, I believe you can use the following code (placed between XMLPage.Open and XMLPage.Send). You will need to adjust the expiry date.

XMLPage.setRequestHeader "Cookie", "NSC_JOlo3vprczwsrc0em1nifnbukr3oebt=ffffffff09a3792945525d5f4f58455e445a4a423660; Path=/; Secure; HttpOnly; Expires=Sat, 03 Jul 2021 02:42:31 GMT;"

But I didn't need to include this to get the HTMLDoc though.

This works! Before I accept the answer, do you have any insight whether it's possible to retrieve more articles that is triggered by clicking "More Stories" button? The href has a parameter of `?created=-24%20months&page=1` but it doesn't seems like simply adding these parameters to the request works. — Raymond Wu, Jul 03 '21 at 04:24

score 2 · Answer 2 · answered Jul 03 '21 at 15:34

2

I'm not sure you can do it with the method I proposed above. Maybe you can figure out the series of requests that will get you more articles but I haven't noticed any simple solutions using this method.

I would recommend using the Selenium web driver to interact with the pages. I find that using an IE object can be hit or miss so I prefer Selenium. It's a little slower and will require some setup (download Selenium, replace the driver with one that matches the web browser version you have, enable Selenium type library in references). The following link can help you get started:

Using Google Chrome in Selenium VBA (Installation Steps)

As for using it to click the button, I've written code that does exactly that. It keeps clicking the "Load More" button until there isn't anything else to load. See the following link for more information:

How to click a webpage button in VBA for parsing

Selenium is nice because you can find elements by their XPath, another method to aid you in selecting buttons.

answered Jul 03 '21 at 15:34

Christopher Weckesser

373
1
8

Thank you! I can do this task using IE but Selenium is 100% no-go zone due to IT policy. I have been trying with series of request but as I mentioned in my question, the missing headers are the only missing piece to complete this code. – Raymond Wu Jul 03 '21 at 15:42
1

Oh you're right. I missed the parameters you found in the href. Substituting `https://www.businesstimes.com.sg/keywords/singapore-parliament?created=-24%20months&page=3` as the URL into the code for the first answer worked for me. Also, feel free to remove the SaveHTMFile function. I forgot it was in there... – Christopher Weckesser Jul 03 '21 at 15:59
It doesn't seems to work for me. `Debug.Print HTMLDoc.getElementsByClassName("widget__title").Length` gave me 21 for both calls, with or without the parameters so it doesn't seems like it's working. – Raymond Wu Jul 03 '21 at 16:06
1

Yeah you'll get the same number of articles but look at the titles and dates. The first article appears to be the same regardless of the page number, but I believe the remaining articles are older news. So you might have to keep making requests for different page numbers instead of getting them all in one go. – Christopher Weckesser Jul 03 '21 at 16:10
This looks promising and I should have no problem finishing this code using this approach. I appreciate all the help you have given me! It bugs me how simple is this solution and I did not catch it since I didn't spot anything in Chrome Devtools.. – Raymond Wu Jul 03 '21 at 16:49
1

Simple solutions are the best! I just discovered https://web.postman.co/ the other day and I find it helpful for quickly testing which headers are mandatory and the effects of passing various parameters with my requests. Might be a little easier than dubugging/post processing HTML files in VBA. – Christopher Weckesser Jul 03 '21 at 17:00

Unable to get the content of the document using XMLHTTP request (Part 2)

2 Answers2

Linked