How to get real source code of html page?

Question

Every time when I'm using standart librarys like urllib2, requests, pycurl I am not getting full source code. How can I get full source code like I am looking on it from chrome, firefox, etc. I am trying to do it like this:

def go_to(link):
    headers = {'User-Agent': USER_AGENT,
               'Accept': ACCEPT,
               'Accept-Encoding': ACCEPT_ENCODING,
               'Accept-Language': ACCEPT_LANGUAGE,
               'Cache-Control': CACHE_CONTROL,
               'Connection': CONNECTION,
               'Host': HOST}
    req = urllib2.Request(link, None, headers)
    response = urllib2.urlopen(req)
    return response.read()

Thank You!

Sorry for my bad english.

UPD: This is full code from browser:

 <td colspan="1"><font class="spy1">1</font> <font class="spy14">192.3.10.113<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(TwoFiveFiveSix^OneOneSix)+(Zero0FourFour^ZeroSevenSeven)+(TwoFiveFiveSix^OneOneSix)+(TwoFiveFiveSix^OneOneSix))</script><font class="spy2">:</font>8088</font></td>

This is not full code from my script:

<font class="spy14">192.3.10.113<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(Eight7FiveSix^Seven1One)+(FiveZeroTwoOne^Two3Zero)+(Eight7FiveSix^Seven1One)+(Eight7FiveSix^Seven1One))</script></font>

Can you explain how what you receive is "not full source code" ? (Note that looking at HTML in a browser shows you the browsers *interpretation* thereof) — Alex K., May 14 '14 at 14:43

alecxe · Answer 1 · 2014-05-14T14:48:56.847

2

Since there can be javascript, AJAX calls involved in forming the web page, to be sure you are getting the same source code as you see in the browser, you need to use tools that actually use real browsers, like selenium:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get(link)

print browser.page_source

edited May 14 '14 at 14:48

answered May 14 '14 at 14:43

alecxe

462,703
120
1,088
1,195

Thank you, but this script must be fast, and i have plans to modify it to multithreading mode. – Valeriy Gaydar May 14 '14 at 14:55
@ValeriyG then you can make use of a headless browser, e.g. http://www.realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs. – alecxe May 14 '14 at 14:56

score 0 · Accepted Answer · edited Sep 29 '18 at 18:44

0

The best solution is :

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://webscraping.com'  
r = Render(url)  
html = r.frame.toHtml()

Source: http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

UPD: Type of output is QString. If you want to convert it to string use

html = r.frame.toHtml().toUtf8().data()

edited Sep 29 '18 at 18:44

Asclepius

57,944
17
167
143

answered May 14 '14 at 15:06

Valeriy Gaydar

500
1
6
26

How to pass headers to these? – venkatadileep Jan 10 '20 at 12:58
@venkatadileep I suppose you should create QNetworkRequest object, set headers for it and pass this one to "load" method as argument. According to official docs. – Valeriy Gaydar Jan 10 '20 at 14:10

How to get real source code of html page?

2 Answers2

Linked