html from requests not the same as source code

Question

I'm trying to scrape this link: 34th government

(https://knesset.gov.il/govt/eng/GovtByNumber_eng.asp)

which has several tables, but when i perform a request using this code:

import requests
from bs4 import BeautifulSoup

govts_url = r'https://knesset.gov.il/govt/eng/GovtByNumber_eng.asp'
website_url = requests.get(govts_url).text
soup = BeautifulSoup(website_url, 'lxml')
print(f"HTML: \n {soup.prettify()}")

I get the following result:

 <html>
 <head>
  <meta charset="utf-8"/>
  <script>
   window.rbzid="Q5gSRBmIWVopQazRgPTWKOEV0wGh1o+KvPO3KMiDuHxM9vVecPeHn4ult+Ba/KU9zInGRSRXUggEmkFs+D5NKSC/WEkCn+B4PCw9CeWkT+Q=";
        u82222.O=function(x){return x;};u82222.E=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.i=function(x,y){return x+y;};u82222.A=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.Y=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.n=function(x,y){return x+y;};u82222.f=function(x,y){return x+y;};u82222.u=function(){var M=function(K,N){var I=N&0xffff;var r=N-I;return(r*K|0)+(I*K|0)|0;},Y=function(x,d,Z){var n=0xcc9e2d51,b=0x1b873593;var E=Z;var O=d&~0x3;for(var w=0;w<O;w+=4){var e=x.charCodeAt(w)&0xff|(x.charCodeAt(w+1)&0xff)<<8|(x.charCodeAt(w+2)&0xff)<<16|(x.charCodeAt(w+3)&0xff)<<24;e=M(e,n);e=(e&0x1ffff)<<15|e>>>17;e=M(e,b);E^=e;E=(E&0x7ffff)<<13|E>>>19;E=E*5+0xe6546b64|0;}e=0;switch(d%4){case 3:e=(x.charCodeAt(O+2)&0xff)<<16;case 2:e|=(x.charCodeAt(O+1)&0xff)<<8;case 1:e|=x.charCodeAt(O)&0xff;e=M(e,n);e=(e&0x1ffff)<<15|e>>>17;e=M(e,b);E^=e;}E^=d;E^=E>>>16;E=M(E,0x85ebca6b);E^=E>>>13;E=M(E,0xc2b2ae35);E^=E>>>16;return E;};return{u:Y};}();u82222.d=function(x,y){return x+y;};u82222.K=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.N=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.Z=function(x,y){return x+y;};u82222.I=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.e=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.t=function(){return{u:function(K){var A='',I=decodeURI("1?'%1CYH.=uVWU~%254_hW,o,WKM%22(-W%5BW,o,LU%075?'%1CH%5D9.5LU%07#?'%1C@W,o6LU%07%22?'%1Ch%5C$.%25N%07D1?'%1C%5DW,o4LU%07%124=LU%07%3C.%25N%07F=?'%1COW,o7bAH%3E?'%1CL%5B.=uF@F%3E%02%25N%07v%0F13SG%5D?,:AWU~13LU%07%3E?'%1CvY8%205FFD.=uF%5BF%3C-%25N%07J1?'%1C%5CW,o7");for(var Y=0,M=0;Y<I.length;Y++,M++){if(M===K.length){M=0;}A+=String.fromCharCode(I.charCodeAt(Y)^K.charCodeAt(M));}A=A.split('~|.');return function(t){return A[t];};}('PA[2))')};}();u82222.o=function(x,y){return x+y;};u82222.r=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.b=function(x,y){return x+y;};u82222.w=function(x){return x;};u82222.s=function(x,y){return x+y;};u82222.F=function(x,y){return x+y;};u82222.M=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.T=function(x,y){return x>y;};function u82222(){}u82222.x=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};(typeof window==="object"?window:global).u82222=u82222;_=window;if(u82222.w(u82222.O(_[u82222.r(24)+u82222.e(0)+u82222.E(25)+u82222.E(14)+u82222.e(18)])||_[u82222.N(26)]||_[u82222.d(u82222.F(u82222.n(u82222.e(28),u82222.r(30))+u82222.N(20),u82222.N(14)),u82222.r(18))]||_[u82222.x(23)])||_[u82222.b(u82222.x(16),u82222.x(19))+u82222.r(6)+u82222.x(11)]||_[u82222.Z(u82222.E(6)+u82222.x(10)+u82222.e(9),u82222.x(14))]||_[u82222.s(u82222.T(975.11,476.89)?u82222.N(8):(13,105.77),u82222.E(1))+u82222.E(5)+u82222.N(25)]||_[u82222.E(4)]||_[u82222.o(u82222.x(3)+u82222.N(29)+u82222.e(14),u82222.e(15))+u82222.N(10)+u82222.x(7)]||_[u82222.i(u82222.e(2)+u82222.N(18)+u82222.N(12)+u82222.e(13)+u82222.x(22)+u82222.E(15)+u82222.E(25),u82222.e(27))+u82222.E(21)]){}else{location[u82222.f(u82222.r(11)+u82222.x(6)+u82222.e(17)+u82222.N(0),u82222.e(2))]();}
  </script>
 </head>
 <body>
 </body>
</html>

Which is, of course, not the content i desire. I guess i'm missing some kind of "activation" to the site, to see the true content. But how can i see it?

Thx!

Did you check whether the content you're after is dynamically generated? — AMC, Feb 18 '20 at 22:20

score 1 · Accepted Answer · answered Feb 18 '20 at 21:33

I tried with selenium (download the driver that you would, in my case Chromedriver) and it works, you can get the full html source os the page and from here you can continue with the web scraping. I hope this helps you :)

from bs4 import BeautifulSoup
from selenium import webdriver

govts_url = r'https://knesset.gov.il/govt/eng/GovtByNumber_eng.asp'
exe_path = r'C:\Users\JRV\Desktop\WebCrawling/chromedriver.exe'

browser = webdriver.Chrome(exe_path)
browser.get(govts_url)
page = browser.page_source
browser.close()

soup = BeautifulSoup(page, 'html.parser')
print(f"HTML: \n {soup}")

score 0 · Answer 2 · answered Feb 18 '20 at 18:59

0

I believe this could be one of those sites where javascript activates the page, which in that case you would have to use something like Selenium. Check out this post.

answered Feb 18 '20 at 18:59

Hedgy

354
1
3
16

html from requests not the same as source code

2 Answers2