2

I'm trying to write a web scraper using PyQt5 and multithreading so that I can scrape multiple urls in parallel (i'm aware of this : Scrape multiple urls using QWebPage but I really want to write a parallel version and really can't see why it doesn't work) I've written this code :

python
import sys
from PyQt5.QtGui import *
from PyQt5.QtWidgets import *
from PyQt5.QtCore import *

from PyQt5.QtWebEngineWidgets import QWebEnginePage

import time

urlb = "https://www.google.fr/"


class Worker(QRunnable, QWebEnginePage):
    '''
    Worker thread
    '''
    def __init__(self, url):
        super(Worker, self).__init__()
        self.url = url
    
    def _on_load_finished(self):
        print("tfouuu")
        self.html = self.toHtml(self.Callable)
        print('Load finished')

    def Callable(self, html_str):
        self.html = html_str
    
    @pyqtSlot()
    def run(self):
        print("a") 
        time.sleep(2)
        print(self.url)
        print("b")
        QWebEnginePage.__init__(self)
        print("c")
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))
        print("d")

class MainWindow(QMainWindow):


    def __init__(self, *args, **kwargs):
        
        self.threadpool = QThreadPool()
        print("Multithreading with maximum %d threads" % self.threadpool.maxThreadCount())
        
        super(MainWindow, self).__init__(*args, **kwargs)
        
        worker = Worker(urlb)
        worker2 = Worker(urlb)
        self.threadpool.start(worker)
        self.threadpool.start(worker2)


    
    
app = QApplication([])
window = MainWindow()
app.exec_()

But I have 2 problems:

  • the first one is that my code keeps running without stopping (I guess it has to do with the lack of app.quit() line but I don't really know where to put it)

  • and mostly the second problem is that my code prints only 'a', 'b', 'c' -> it doesn't run the connect and load part

John924734
  • 23
  • 3
  • Okay first off Python does not multi-process unless you specifically add the multi-processing aspect to it so you will need to look into that -- and no threading is not multi-processing it is just multi-threading totally different concepts -- just in case you were thinking that was what threading was supposed to do – Dennis Jensen Jul 26 '19 at 13:48
  • 1) Why do you need multi-threading in your case? If you want multi-threading because yes, then unfortunately you will not be able to do it since QWebEnginePage cannot live in another thread, if your goal is to execute several requests without need to wait for the previous request to end so it is possible to propose another solution. – eyllanesc Jul 26 '19 at 22:48
  • 2) Multi-threading is not the same as parallelism, both together with multiprocesing are concurrency techniques, what do these techniques have in common? They want to execute several tasks at the same time, but each one does it from a different perspective, having advantages, disadvantages and requirements. In the case of Qt it only supports multi-threading but not all classes can run in a different thread, and that is the case of QWebEnginePage. – eyllanesc Jul 26 '19 at 22:48
  • Thank you @eyllanesc for your answer, I understand. What’s the other solution you’re thinking of to execute several requests at the same time ? – John924734 Jul 29 '19 at 07:28

1 Answers1

1

QWebEngineView cannot and should not run on another thread.

Instead if you want to get html asynchronously then you should use the Qt signals:

from functools import partial
from PyQt5 import QtCore, QtWidgets, QtWebEngineWidgets


class WebManager(QtCore.QObject):
    def __init__(self, parent=None):
        super(WebManager, self).__init__(parent)
        self.pages = []
        self.results = []

    def load(self, url):
        page = QtWebEngineWidgets.QWebEnginePage(self)
        page.loadFinished.connect(self._on_load_finished)
        self.pages.append(page)
        page.load(QtCore.QUrl(url))

    @QtCore.pyqtSlot(bool)
    def _on_load_finished(self, ok):
        page = self.sender()
        if not isinstance(page, QtWebEngineWidgets.QWebEnginePage):
            return
        if ok:
            wrapper = partial(self.callable, page)
            page.toHtml(wrapper)
        else:
            self.pages.remove(page)
            page.deleteLater()

    def callable(self, page, html):
        self.pages.remove(page)
        url = page.requestedUrl().toString()
        page.deleteLater()
        self.results.append((url, html))
        if not self.pages:
            QtWidgets.QApplication.quit()


if __name__ == "__main__":
    import sys

    app = QtWidgets.QApplication(sys.argv)

    manager = WebManager()

    pages = []
    format_url = "http://pyqt.sourceforge.net/Docs/PyQt5/%s.html"
    for name in dir(QtWebEngineWidgets):
        if name.startswith("Q"):
            url = format_url % name.lower()
            manager.load(url)
    app.exec_()
    for url, html in manager.results:
        print(url)
        print(html)
eyllanesc
  • 235,170
  • 19
  • 170
  • 241