Can't get javascript generated html using python

Question

I'm trying to create a python script that automatically gets the content of a table on a webpage. I manage to have it to work on pure html page, but there is one website that gives me headache... The html seems to be generated by javascript. I tried dryscrape, selenium and qt4 libraries from examples found on several posts but still without success... I just get all the time the html before the javascript did his job.... so without tables.... I can see the table on the browser and when I do "Inspect" the html with Chrome. When I do "View Page Source" in Chrome the table is also not there... may be this can give some hints.

The website is the following:

https://www.ictax.admin.ch/extern/en.html#/security/CH0008899764/20161231

Here is some code I tried out (no table tags in the answer if you check):

Using urlib2:

import urllib2
url="https://www.ictax.admin.ch/extern/en.html#/security/CH0008899764/20161231"
html = urllib2.urlopen(url)
print html

Using dryscrape:

import dryscrape 
session = dryscrape.Session()
session.visit(url) 
response = session.body()
print response

Using selenium:

from selenium import webdriver
driver = webdriver.Chrome("/usr/lib/chromium/chromedriver")
driver.get(url)
print driver.page_source #page_source fetches page after rendering is complete
driver.quit()

Using PyQt4

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit() 

#This does the magic.Loads everything
r = Render(url)  
#result is a QString.
result = r.frame.toHtml()
#QString should be converted to string before processed by lxml
formatted_result = str(result.toAscii())
print formatted_result

I would appreciate so much if somebody could give me some help on this :-)

Cheers

Check this out: http://stackoverflow.com/questions/43423656/trip-advisor-scraping-morelink/43424006#43424006 You wanna try to use a PhantomJS driver to wait for the JS to load the page content. — elena, May 06 '17 at 07:07
@DeanFenster I posted some code that doesn't work (returns html without the table) — Fleppi, May 06 '17 at 07:50
Thanks for the link @elena! I tried it out but I get the same: html code in return, but no table in it... :-( — Fleppi, May 06 '17 at 07:52

score 1 · Accepted Answer · answered May 07 '17 at 17:47

Use an implicit wait (or an explicit one?) to wait for the page to load before searching for any elements:

import selenium
from selenium import webdriver
driver = webdriver.PhantomJS()
url = "https://www.ictax.admin.ch/extern/en.html#/security/CH0008899764/20161231"
driver.get(url)
driver.implicitly_wait(30)
print(driver.find_element_by_tag_name("table").text)

This is the output I am getting:

Titel/Titres/Titoli W Nominell Valoren-Nr. Steuerwert Ertrag / Rendement / Reddito 2016 M Valeur No de Val. imposable Datum / Date Cp. W Brutto KG/KEP zu versteuernder V nominale valeur Val. imposible Data M Brut Ertrag/Rendement Valore Numero di 31.12.2016 ex. zahlb. V lordo imposable/Reddito nominale valore pay. imponible CHF (E) pag. Fr.W. CHF CHF iShares ETF (CH) - iShares SMI (R) (CH), Schweiz
CHF 0.00 889 976 85.31 25.02. 29.02. 36 CHF 0.48
03.03. 07.03. 37 CHF 0.48
11.04. 13.04. 38 CHF 0.70
19.07. 21.07. 40 CHF 0.88
19.07. 21.07. 39 CHF 0.20

Can't get javascript generated html using python

1 Answers1

Linked