1

Every time when I'm using standart librarys like urllib2, requests, pycurl I am not getting full source code. How can I get full source code like I am looking on it from chrome, firefox, etc. I am trying to do it like this:

def go_to(link):
    headers = {'User-Agent': USER_AGENT,
               'Accept': ACCEPT,
               'Accept-Encoding': ACCEPT_ENCODING,
               'Accept-Language': ACCEPT_LANGUAGE,
               'Cache-Control': CACHE_CONTROL,
               'Connection': CONNECTION,
               'Host': HOST}
    req = urllib2.Request(link, None, headers)
    response = urllib2.urlopen(req)
    return response.read()

Thank You!

Sorry for my bad english.

UPD: This is full code from browser:

 <td colspan="1"><font class="spy1">1</font> <font class="spy14">192.3.10.113<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(TwoFiveFiveSix^OneOneSix)+(Zero0FourFour^ZeroSevenSeven)+(TwoFiveFiveSix^OneOneSix)+(TwoFiveFiveSix^OneOneSix))</script><font class="spy2">:</font>8088</font></td>

This is not full code from my script:

<font class="spy14">192.3.10.113<script type="text/javascript">document.write("<font class=spy2>:<\/font>"+(Eight7FiveSix^Seven1One)+(FiveZeroTwoOne^Two3Zero)+(Eight7FiveSix^Seven1One)+(Eight7FiveSix^Seven1One))</script></font>
shahab
  • 313
  • 3
  • 16
Valeriy Gaydar
  • 500
  • 1
  • 6
  • 26
  • Can you explain how what you receive is "not full source code" ? (Note that looking at HTML in a browser shows you the browsers *interpretation* thereof) – Alex K. May 14 '14 at 14:43
  • what do you get? how does it differ from what you want? – KevinDTimm May 14 '14 at 14:43

2 Answers2

2

Since there can be javascript, AJAX calls involved in forming the web page, to be sure you are getting the same source code as you see in the browser, you need to use tools that actually use real browsers, like selenium:

from selenium import webdriver

browser = webdriver.Firefox()
browser.get(link)

print browser.page_source
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you, but this script must be fast, and i have plans to modify it to multithreading mode. – Valeriy Gaydar May 14 '14 at 14:55
  • @ValeriyG then you can make use of a headless browser, e.g. http://www.realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs. – alecxe May 14 '14 at 14:56
0

The best solution is :

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  

class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://webscraping.com'  
r = Render(url)  
html = r.frame.toHtml() 

Source: http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

UPD: Type of output is QString. If you want to convert it to string use

html = r.frame.toHtml().toUtf8().data()
Asclepius
  • 57,944
  • 17
  • 167
  • 143
Valeriy Gaydar
  • 500
  • 1
  • 6
  • 26