I am attempting to scrape a website which as well as requiring a login, the core data is rendered with javascript and XHR files. I am using the html-requests
library, however the render()
function appears to have no effect on the webpage. Here is my code:
import requests_html as requests
import bs4 as bs
# variables...
# def createForm()...
with requests.HTMLSession() as session:
print("retrieving page...")
initial_response = session.get(login_url)
print("logging in...")
response = session.post(url = login_url, data = createForm(initial_response))
page_html = session.get(target_url)
page = bs.BeautifulSoup(page_html.content, 'lxml')
html_before = page.prettify()
print('rendering...')
page_html.html.render(sleep = 5)
page_rendered = bs.BeautifulSoup(page_html.content, 'lxml')
html_after = page_rendered.prettify()
if html_before == html_after:
print("they are the same")
This is the html returned (the important bits):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
Home | Compass
</title>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<script src="/cdn-cgi/apps/head/nXBUbHOMoxcWCnUqQqrCuyGGJ4s.js">
</script>
Boring CSS...
<meta content="text/html;charset=utf-8" http-equiv="Content-type"/>
</head>
<body class="greyBody">
Dull JSON...
Compass.assemblyVersion = "11.44.1.0";Compass.isDev = false;Compass.organisationUserId = 1921;Compass.organisationUserSussiId = "SAGE.ALLEN";Compass.organisationUserBaseRole = 1;Compass.organisationUserRoles = { "AfterHoursAccess": true, "MyFilesBase": true, "StaffStudentsMisc": true, "StudentsMisc": true};Compass.schoolId = "shenton.wa.edu.au";Compass.schoolName = "Shenton College";Compass.schoolPrimaryFqdn = "shenton-wa.compass.education";Compass.headAncestorId = "shenton.wa.edu.au";Compass.hasChildOrganisations = false;Compass.isInHierarchy = false;Compass.isTargetingAncestry = false;
</script>
<a href="/Communicate/Documentation/Help.aspx" style="position: absolute; left: -999px">
Help
</a>
<form action="./" id="aspnetForm" method="post">
<div class="aspNetHidden">
<input id="__EVENTTARGET" name="__EVENTTARGET" type="hidden" value=""/>
<input id="__EVENTARGUMENT" name="__EVENTARGUMENT" type="hidden" value=""/>
<input id="__VIEWSTATE" name="__VIEWSTATE" type="hidden" value="cUphbVG2sD46yFu7rFLU15w0eiJn+7KXkA6I6Cg/7RQ9m3rwlz5poc6KdcuOMApzHcafUPq70DbpviYl6V7vDYHgLMx23YF8OtMtdcmxVSk="/>
</div>
<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['aspnetForm'];
theForm = document.aspnetForm;
}
}heForm.submit();GUMENT.value = eventArgument;= false)) {
}
//]]>
</script>
<script src="/WebResource.axd?d=pynGkmcFUV13He1Qd6_TZMBp8pi1aG3kj_Rrf_NckYpQU5qPM8p1FZ-Rik-uln5rcqPDnR_gxYalKXvDaBNyhg2&t=636165368714134089" type="text/javascript">
</script>
<script src="/ScriptResource.axd?d=NJmAwtEo3Ipnlaxl6CMhvk3jMxAVfdhwj8EfOKm3TxozcZHxkgtaPL9w9WaPcaq30sskp_Glm4jiP922KJP1an86NqAQUdSFO5rhKIKoAuO5v3uoNlAezbrUkCluOH1LV_F9OB_HI13vUK6I2eQlLQ80jzjIESOQbg5oZuzg3A01&t=ffffffffd416f7fc" type="text/javascript">
</script>
<script src="/ScriptResource.axd?d=dwY9oWetJoJoVpgL6Zq8OJy60eKvb9zs3HNOFEuh2HK-a1JlTWrINdUt4GmfnVpd-vC-hGQfNOA-_hpGAIQxLJ6TRvLcoTQZ7vzC5ouXwZ7EB1Rqgo_p4dWNsoX1AAW-I0gKht_6IBwAHOTP4LV38H7v4PjwKJBs7h2NgozR47s1&t=ffffffffd416f7fc" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/System/Scripts/4ed8095_javascript-resource-manager.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/62cdce8_utility.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/System/Scripts/7c66c7e_ravenjs-loader.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ext-js4.2.2/5de6c0f_ext-all.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ce7ba4b_jquery-1.8.3.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/81a11e3_autosuggest-widget.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ef94bb5_jquery-json-2.3.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/5fee56b_jquery.elastic.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/b9aa653_jquery.simplemodal.1.4.3.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/moment/cdeefcf_moment-and-data.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ext-js4.2.2/resources/js/8f9f704_ext-extensions-and-theme.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/d3ac6df_impersonate-widget.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Common/Scripts/0d786b6_compass.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/System/Scripts/bb8b963_request-capture.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/System/Scripts/17975e6_external-resource-monitor.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/ckeditor/0d3caaa_ckeditor.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/Calendar/Scripts/625070f_calendar-and-extensions.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/PageScripts/583fb06_HomePage.Chronicle.min.js" type="text/javascript">
</script>
<script src="https://assets.compass.education/StaticAssetsK/PageScripts/9571a67_HomePage.min.js" type="text/javascript">
</script>
<script type="text/javascript">
//<![CDATA[
Sys.WebForms.PageRequestManager._initialize('ctl00$ctl04', 'aspnetForm', [], [], [], 90, 'ctl00');
//]]>
</script>
<script type="text/javascript">
});ersonateWindow.show();xt.create('Compass.widgets.ImpersonateWidget',
{});, function(e) {ggestions?sessionstate=readonly",
Unremarkable HTML...
Encrypted:6SeqWGbwjzN6ZfMnVVAU1sXLmHGC06o6K+A6lAhEBFCLQgeQq6ZU810mqSzy0zNyMwUhKnrAlYfvvlTuy5xpIj4OkW4pGBLFN6PVai3RoevYQkgbvy9vqVBanzrNVfRGsMIE8kgq+8pJGtNiCveqQAvzLfhgHhm5QQ8/k4ShskzjZdRPX9MUNpa-->kHWHQOxCM73dFIgYrWM6PexC+wA31RdtyPTEp7gRCb7ulIlQFSKresH2xPmdHNeLhA7mCefNrbBDMG7eJ5kqhLsh3QqbxMQ1IABdA42nGGSdw1GFkmRJYS06mNS4Cjp44cmQBt
<script src="https://assets.compass.education/StaticAssetsK/Scripts/Lib/e0c3e6b_LazyLoad.min.js" type="text/javascript">
</script>
<script type="text/javascript">
</script>
<div class="aspNetHidden">
<input id="__VIEWSTATEGENERATOR" name="__VIEWSTATEGENERATOR" type="hidden" value="CA0B0334"/>
</div>
</form>
</body>
</html>
I have not managed to decipher all the scripts as I am not experienced in javascript, although they appear to be fetching the data. Any explanation as to why these scripts are not running or any alternative solution (which is adequately fast) is appreciated.