0

I'm trying to scrape this link: 34th government

(https://knesset.gov.il/govt/eng/GovtByNumber_eng.asp)

which has several tables, but when i perform a request using this code:

import requests
from bs4 import BeautifulSoup

govts_url = r'https://knesset.gov.il/govt/eng/GovtByNumber_eng.asp'
website_url = requests.get(govts_url).text
soup = BeautifulSoup(website_url, 'lxml')
print(f"HTML: \n {soup.prettify()}")

I get the following result:

 <html>
 <head>
  <meta charset="utf-8"/>
  <script>
   window.rbzid="Q5gSRBmIWVopQazRgPTWKOEV0wGh1o+KvPO3KMiDuHxM9vVecPeHn4ult+Ba/KU9zInGRSRXUggEmkFs+D5NKSC/WEkCn+B4PCw9CeWkT+Q=";
        u82222.O=function(x){return x;};u82222.E=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.i=function(x,y){return x+y;};u82222.A=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.Y=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.n=function(x,y){return x+y;};u82222.f=function(x,y){return x+y;};u82222.u=function(){var M=function(K,N){var I=N&0xffff;var r=N-I;return(r*K|0)+(I*K|0)|0;},Y=function(x,d,Z){var n=0xcc9e2d51,b=0x1b873593;var E=Z;var O=d&~0x3;for(var w=0;w<O;w+=4){var e=x.charCodeAt(w)&0xff|(x.charCodeAt(w+1)&0xff)<<8|(x.charCodeAt(w+2)&0xff)<<16|(x.charCodeAt(w+3)&0xff)<<24;e=M(e,n);e=(e&0x1ffff)<<15|e>>>17;e=M(e,b);E^=e;E=(E&0x7ffff)<<13|E>>>19;E=E*5+0xe6546b64|0;}e=0;switch(d%4){case 3:e=(x.charCodeAt(O+2)&0xff)<<16;case 2:e|=(x.charCodeAt(O+1)&0xff)<<8;case 1:e|=x.charCodeAt(O)&0xff;e=M(e,n);e=(e&0x1ffff)<<15|e>>>17;e=M(e,b);E^=e;}E^=d;E^=E>>>16;E=M(E,0x85ebca6b);E^=E>>>13;E=M(E,0xc2b2ae35);E^=E>>>16;return E;};return{u:Y};}();u82222.d=function(x,y){return x+y;};u82222.K=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.N=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.Z=function(x,y){return x+y;};u82222.I=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.e=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.t=function(){return{u:function(K){var A='',I=decodeURI("1?'%1CYH.=uVWU~%254_hW,o,WKM%22(-W%5BW,o,LU%075?'%1CH%5D9.5LU%07#?'%1C@W,o6LU%07%22?'%1Ch%5C$.%25N%07D1?'%1C%5DW,o4LU%07%124=LU%07%3C.%25N%07F=?'%1COW,o7bAH%3E?'%1CL%5B.=uF@F%3E%02%25N%07v%0F13SG%5D?,:AWU~13LU%07%3E?'%1CvY8%205FFD.=uF%5BF%3C-%25N%07J1?'%1C%5CW,o7");for(var Y=0,M=0;Y<I.length;Y++,M++){if(M===K.length){M=0;}A+=String.fromCharCode(I.charCodeAt(Y)^K.charCodeAt(M));}A=A.split('~|.');return function(t){return A[t];};}('PA[2))')};}();u82222.o=function(x,y){return x+y;};u82222.r=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};u82222.b=function(x,y){return x+y;};u82222.w=function(x){return x;};u82222.s=function(x,y){return x+y;};u82222.F=function(x,y){return x+y;};u82222.M=function (){return typeof u82222.u.u==='function'?u82222.u.u.apply(u82222.u,arguments):u82222.u.u;};u82222.T=function(x,y){return x>y;};function u82222(){}u82222.x=function (){return typeof u82222.t.u==='function'?u82222.t.u.apply(u82222.t,arguments):u82222.t.u;};(typeof window==="object"?window:global).u82222=u82222;_=window;if(u82222.w(u82222.O(_[u82222.r(24)+u82222.e(0)+u82222.E(25)+u82222.E(14)+u82222.e(18)])||_[u82222.N(26)]||_[u82222.d(u82222.F(u82222.n(u82222.e(28),u82222.r(30))+u82222.N(20),u82222.N(14)),u82222.r(18))]||_[u82222.x(23)])||_[u82222.b(u82222.x(16),u82222.x(19))+u82222.r(6)+u82222.x(11)]||_[u82222.Z(u82222.E(6)+u82222.x(10)+u82222.e(9),u82222.x(14))]||_[u82222.s(u82222.T(975.11,476.89)?u82222.N(8):(13,105.77),u82222.E(1))+u82222.E(5)+u82222.N(25)]||_[u82222.E(4)]||_[u82222.o(u82222.x(3)+u82222.N(29)+u82222.e(14),u82222.e(15))+u82222.N(10)+u82222.x(7)]||_[u82222.i(u82222.e(2)+u82222.N(18)+u82222.N(12)+u82222.e(13)+u82222.x(22)+u82222.E(15)+u82222.E(25),u82222.e(27))+u82222.E(21)]){}else{location[u82222.f(u82222.r(11)+u82222.x(6)+u82222.e(17)+u82222.N(0),u82222.e(2))]();}
  </script>
 </head>
 <body>
 </body>
</html>

Which is, of course, not the content i desire. I guess i'm missing some kind of "activation" to the site, to see the true content. But how can i see it?

Thx!

Guy Barash
  • 470
  • 5
  • 17

2 Answers2

1

I tried with selenium (download the driver that you would, in my case Chromedriver) and it works, you can get the full html source os the page and from here you can continue with the web scraping. I hope this helps you :)

from bs4 import BeautifulSoup
from selenium import webdriver

govts_url = r'https://knesset.gov.il/govt/eng/GovtByNumber_eng.asp'
exe_path = r'C:\Users\JRV\Desktop\WebCrawling/chromedriver.exe'

browser = webdriver.Chrome(exe_path)
browser.get(govts_url)
page = browser.page_source
browser.close()

soup = BeautifulSoup(page, 'html.parser')
print(f"HTML: \n {soup}")
0

I believe this could be one of those sites where javascript activates the page, which in that case you would have to use something like Selenium. Check out this post.

Hedgy
  • 354
  • 1
  • 3
  • 16