1

I have the following class to return me the HTML of any given WebPage:

from PyQt4.QtCore import QUrl, SIGNAL
from PyQt4.QtGui import QApplication
from PyQt4.QtWebKit import QWebPage

from bs4 import BeautifulSoup
from bs4.dammit import UnicodeDammit
import sys
import signal


class Render(QWebPage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebPage.__init__(self)
        self.html = None
        signal.signal(signal.SIGINT, signal.SIG_DFL)
        self.connect(self, SIGNAL('loadFinished(bool)'), self._finished_loading)
        self.mainFrame().load(QUrl(url))
        self.app.exec_()

    def _finished_loading(self, result):
        self.html = self.mainFrame().toHtml()
        self.soup = BeautifulSoup(UnicodeDammit(self.html).unicode_markup)
        self.app.quit()   

And I have a loop to iterate over list of WebPages with JavaScript that need to be run, such as:

l = ["http://host.com/page1", "http://host.com/page2"]

for page in l:
    soup = Render(page).soup
    #Do-something

Problem is that the JavaScript code is only executed in the first page that's loaded, not interpreting any after that.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • That's probably because `QWebPage` doesn't wait for javascript execution before emitting the `loadFinished` signal –  Feb 08 '13 at 17:23
  • Shouldn't the first run also fail to execute it then? – user2055077 Feb 08 '13 at 18:42
  • Not necessarily, it probably loads fast enough on your first page, or perhaps the javascript isn't completely rendered –  Feb 08 '13 at 18:54
  • @X.Jacobs I don't think `loadFinished` is the reason. Since qt documentation states: The `loadStarted()` signal is emitted when the view begins loading. The `loadProgress()` signal, on the other hand, is emitted whenever an element of the web view completes loading, such as an embedded image, a **script**, etc. Finally, the `loadFinished()` signal is emitted when the view has loaded completely. – nymk Feb 08 '13 at 19:08
  • @nymk Yes, the script may be loaded, but not rendered –  Feb 08 '13 at 19:25
  • @user2055077, I have a question based on your code. I'm using your Render class, it works fine at the first time, but always when call it for a second time, it gives me this error: `QObject::connect: Cannot connect (null)::configurationAdded ` – Marcelo Assis Feb 19 '14 at 21:24

1 Answers1

0

It probably the page has been loaded successfully, but it has more than one frame. To be more precise, sometimes page.mainFrame().childFrames() is not empty. You need to process not only the main frame, but also its children.
For example:

def _finished_loading(self, result):
    self.html = self.mainFrame().toHtml()
    self.soup = BeautifulSoup(UnicodeDammit(self.html).unicode_markup)
    # process childFrames
    self.htmls = [frame.toHtml() for frame in self.mainFrame().childFrames()]
    self.soups = [BeautifulSoup(UnicodeDammit(html).unicode_markup) for html in self.htmls]
    self.app.quit()
nymk
  • 3,323
  • 3
  • 34
  • 36