0

I have a PyQt5 scraper that has to render a webpage before scraping it, since the webpage has dynamic data. This is the most barebones version of the script, which unfortunately still has several parts.

The only reason the render needs to be called from a function is because sometimes it will freeze up infinitely, so it has a multi threaded timeout on it. Which is all well and good, except the Render won't work properly inside of a function because QApplication isn't properly passed in for some reason. I can define App = QApplication(sys.argv) and put the Render class inside of the ScrapeClockwise function, but that requires defining App within that function as well(It can't be passed in for some reason.) And then if the function times out it'll kick it out without closing QApplication, so the next time the function runs the program will just crash. THIS EVEN HAPPENS IF IT IS DEFINED WITHIN A TRY-EXCEPT STATEMENT, which is extra weird.

As you can see there are a lot of strange interactions here and if anyone could shed some light on any of them I would be incredibly thankful, I've been beating my head against this for a while now.

import sys
from PyQt5.QtCore import *
from PyQt5.QtWebKitWidgets import *
from PyQt5.QtWidgets import *
from bs4 import BeautifulSoup
import threading
import functools
from threading import Thread

def timeout(timeout):
    def deco(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            res = [Exception('function [%s] timeout [%s seconds] exceeded!' % (func.__name__, timeout))]

            def newFunc():
                try:
                    res[0] = func(*args, **kwargs)
                except Exception as e:
                    res[0] = e

            t = Thread(target=newFunc)
            t.daemon = True
            try:
                t.start()
                t.join(timeout)
            except Exception as je:
                print('error starting thread')
                raise je
            ret = res[0]
            if isinstance(ret, BaseException):
                raise ret
            return ret

        return wrapper

    return deco

APP = QApplication(sys.argv)

class SomeClass(QWidget):
    def some_method(self):
        APP.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | 
QEventLoop.WaitForMoreEvents)

class Render(QWebPage):
    def __init__(self, url):
        QWebPage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.mainFrame().load(QUrl(url))
        APP.exec_()

    def _loadFinished(self, result):
        self.frame = self.mainFrame()
        APP.quit()

def ScrapeClockwise(l):
    url = "https://www.clockwisemd.com/hospitals/" + str(l).zfill(4) + "/appointments/new"
    print(url)
    r = Render(url)
    result = r.frame.toHtml()
    soup = BeautifulSoup(result, 'html.parser')
    info = soup.find_all('h4')
    for i in info:
        print(i.get_text())

l = 0
while True:
    func = timeout(5)(ScrapeClockwise)
    try:
        func(str(l))
    except Exception as e:
        print(e)
        pass  # handle errors here
    l += 1
  • Starting a QApplication within the init of a Qt class is not a good idea, and there's no need to continuously create new apps each time also. I'd suggest you to rethink the whole concept and better use the loadFinished signal to handle its result so that it restarts scraping after the html is processed (possibly using QTimer). – musicamante Apr 17 '20 at 20:39

1 Answers1

0

Each technology has its limitations and in the case of Qt, you cannot use a QWebPage in a secondary thread. You must also understand how technology works, many of the elements of Qt need and use an event loop, and that can help solve. In this case a QTimer can be used to measure the elapsed time and if the timeout has been triggered then load a new page.

Using this question I modified to obtain this solution considering the above:

from PyQt5 import QtCore, QtWidgets, QtWebKitWidgets

from bs4 import BeautifulSoup


def create_urls():
    l = 0
    while True:
        yield "https://www.clockwisemd.com/hospitals/{:04d}/appointments/new".format(l)
        l += 1


class WebPage(QtWebKitWidgets.QWebPage):
    def __init__(self):
        super(WebPage, self).__init__()
        self.mainFrame().loadFinished.connect(self.handleLoadFinished)
        self.mainFrame().urlChanged.connect(print)

        self.timer = QtCore.QTimer(
            singleShot=True, interval=10 * 1000, timeout=self.on_timeout
        )

    def start(self, generator):
        self.generator = generator
        self.fetchNext()

    def fetchNext(self):
        url = next(self.generator)
        self.mainFrame().load(QtCore.QUrl(url))
        self.timer.start()

    def processCurrentPage(self):
        html = self.mainFrame().toHtml()
        print("[url]: {}".format(self.mainFrame().url().toString()))

        soup = BeautifulSoup(html, "html.parser")
        info = soup.find_all("h4")
        for i in info:
            print(i.get_text())

    def on_timeout(self):
        print("[Timeout]")
        self.fetchNext()

    def handleLoadFinished(self):
        if self.timer.isActive():
            self.timer.blockSignals(True)
            self.timer.stop()
            self.timer.blockSignals(False)
        self.processCurrentPage()
        self.fetchNext()


if __name__ == "__main__":
    import sys

    app = QtWidgets.QApplication(sys.argv)
    webpage = WebPage()
    webpage.start(create_urls())
    sys.exit(app.exec_())
eyllanesc
  • 235,170
  • 19
  • 170
  • 241