2

I'm trying to render websites in PyQt that are written in java. The first site is rendered without problems and scraped for the information I need, but when I want to use the same class to render another site and retrieve the new data it tells me the frame that's defined in the Render class is not defined (which was defined for the first website, which worked perfectly fine in retrieving the data that I needed). So, why is this happening? Am I missing something fundamental in Python? My understanding is that when the first site has been rendered, then the object will be garbage collected and the second one can be rendered. Below is the referred code:

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

class Render(QWebPage):  
    def __init__(self, url):  
        self.app = QApplication(sys.argv)  
        QWebPage.__init__(self)  
        self.loadFinished.connect(self._loadFinished)  
        self.mainFrame().load(QUrl(url))  
        self.app.exec_()


    def _loadFinished(self, result):  
        self.frame = self.mainFrame()  
        self.app.quit()

urls = ['http://pycoders.com/archive/', 'http://us4.campaign-archive2.com/home/?u=9735795484d2e4c204da82a29&id=64134e0a27']

for url in urls:
    r = Render(url)
    result = r.frame.toHtml()
    #This step is important.Converting QString to Ascii for lxml to process
    #QString should be converted to string before processed by lxml
    formatted_result = str(result)
    #Next build lxml tree from formatted_result
    tree = html.fromstring(formatted_result)
    #Now using correct Xpath we are fetching URL of archives
    archive_links = tree.xpath('//div[@class="campaign"]/a/@href')[1:5]
    print (archive_links)

The error message I'm getting:

  File "javaweb2.py", line 24, in <module>
    result = r.frame.toHtml()
AttributeError: 'Render' object has no attribute 'frame'

Any help would be much appreciated!

PythonTAE
  • 23
  • 4

1 Answers1

0

That's because the self.frame is only defined when self._loadFinished() is called, which only occurs when the QWebPage instance emits a signal. So barring several dubious practices I see in the code you posted, the following would solve the issue (not the line with **** is important):

class Render(QWebPage):  
    def __init__(self, url):  
        self.app = QApplication(sys.argv)  
        self.frame = None  # *****
        QWebPage.__init__(self)  
        self.loadFinished.connect(self._loadFinished)  
        self.mainFrame().load(QUrl(url))  
        self.app.exec_()

    def _loadFinished(self, result):  
        self.frame = self.mainFrame()  
        self.app.quit()

urls = ['http://pycoders.com/archive/', 'http://us4.campaign-archive2.com/home/?u=9735795484d2e4c204da82a29&id=64134e0a27']

for url in urls:
    r = Render(url)
    # wait till frame arrives: 
    while r.frame is None:
        # pass  # option 1: works, but will cause 100% cpu 
        time.sleep(0.1)  # option 2: much better

    result = r.frame.toHtml()
    ...

So the "pass" would work but will consume 100% cpu as the loop is executed a million times per second. Using the timer only checks every 1/10th second and will be very low cpu consumption.

The best of all solutions of course is to put the logic that depends on the frame being available (i.e. code that is currently in the URL loop below r=Render(url)) in a function that will get called when the loadFinished signal is emitted. Since you can't control the order of signals, the best option is to move that code into the _loadfinished() method.

Oliver
  • 27,510
  • 9
  • 72
  • 103
  • I put the suggested code into the _loadfinished() method, and am calling the class from inside a main function. It works fine with one url, but as soon as I want to render two websites, one after the other, it gets hung up in the first render object of the first website. It seems I have to somehow jump out of the render class (change the scope out of the event loop) to continue with the second website. Is there a way to do this? Using exit() just quits the program. Maybe the python app has to close and reopen to render the next page, which is impossible because the app is reopened in terminal? – PythonTAE Feb 08 '16 at 12:16
  • Can you post a separate question with the code as you have fixed it. Thanks. – Oliver Feb 08 '16 at 17:26
  • Here is the separate question: http://stackoverflow.com/questions/35311673/how-to-scrape-several-websites-with-pyqt4-scope-change – PythonTAE Feb 11 '16 at 18:12