8

I'm doing Sentdex's PyQt4 YouTube tutorial right here. I'm trying to follow along but use PyQt5 instead. It's a simple web scraping app. I followed along with Sentdex's tutorial and I got here:

enter image description here

Now I'm trying to write the same application with PyQt5 and this is what I have:

import os
import sys
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl, QEventLoop
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from bs4 import BeautifulSoup
import requests


class Client(QWebEnginePage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.loadFinished.connect(self._loadFinished)
        self.load(QUrl(url))
        self.app.exec_()

    def _loadFinished(self):
        self.app.quit()


url = 'https://pythonprogramming.net/parsememcparseface/'
client_response = Client(url)

#I think the issue is here at LINE 26
source = client_response.mainFrame().toHtml()

soup = BeautifulSoup(source, "html.parser")
js_test = soup.find('p', class_='jstest')
print(js_test.text)

When I run this, I get the message:

source = client_response.mainFrame().toHtml()
AttributeError: 'Client' object has no attribute 'mainFrame'

I've tried a few different solutions but none work. Any help would be appreciated.

EDIT

Logging QUrl(url) on line 15 returns this value:

PyQt5.QtCore.QUrl('https://pythonprogramming.net/parsememcparseface/')

When I try source = client_response.load(QUrl(url)) for line 26, I end up with the message:

File "test3.py", line 28, in <module> soup = BeautifulSoup(source, "html.parser") File "/Users/MYNAME/.venv/qtproject/lib/python3.6/site-packages/bs4/__init__.py", line 192, in __init__ elif len(markup) <= 256 and ( TypeError: object of type 'NoneType' has no len()

When I try source = client_response.url() I get:

soup = BeautifulSoup(source, "html.parser")
      File "/Users/MYNAME/.venv/qtproject/lib/python3.6/site-packages/bs4/__init__.py", line 192, in __init__
        elif len(markup) <= 256 and (
    TypeError: object of type 'QUrl' has no len()
Les Paul
  • 1,260
  • 5
  • 22
  • 46
  • It looks like mainFrame() is a self defined method within `class Client(QWebEnginePage)`, because it doesn't exist in the class according to the [Qt5 Documentation](http://doc.qt.io/qt-5/qwebenginepage.html). Are you sure there isn't more to the tutorial that you are missing? – NineTails Feb 09 '17 at 22:07
  • mainFrame() was a method in PyQt4 with QWebPage: http://doc.qt.io/qt-5/qtwebenginewidgets-qtwebkitportingguide.html – Les Paul Feb 09 '17 at 22:09
  • Without knowing much about the webkit it seems that mainFrame() has been absorbed to other functions, where instead you specify whether the frame is the main one or a child frame by using a bool indicator. For example `acceptNavigationRequest(const QUrl &url, NavigationType type, bool isMainFrame)`. – NineTails Feb 09 '17 at 22:16
  • I could have guessed what you are saying, but as someone who isn't an expert especially in the realm of Python, knowing WHY something doesn't work doesn't help much if I don't know HOW to fix it ---- with code samples. – Les Paul Feb 09 '17 at 22:21
  • 1
    @LesPaul. `QtWebEngine` is not a drop-in replacement for `QWebKit` - there are many features that have changed, or are completely missing. – ekhumoro Feb 09 '17 at 23:05

2 Answers2

21

you must call the QWebEnginePage::toHtml() inside the definition of the class. QWebEnginePage::toHtml() takes a pointer function or a lambda as a parameter, and this pointer function must in turn take a parameter of 'str' type (this is the parameter that contains the page's html). Here is sample code below.

import bs4 as bs
import sys
import urllib.request
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl

class Page(QWebEnginePage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def _on_load_finished(self):
        self.html = self.toHtml(self.Callable)
        print('Load finished')

    def Callable(self, html_str):
        self.html = html_str
        self.app.quit()


def main():
    page = Page('https://pythonprogramming.net/parsememcparseface/')
    soup = bs.BeautifulSoup(page.html, 'html.parser')
    js_test = soup.find('p', class_='jstest')
    print js_test.text

if __name__ == '__main__': main()
Simon
  • 19,658
  • 27
  • 149
  • 217
Ayanda Khanyile
  • 235
  • 2
  • 14
  • 5
    This works fine if I need to get just one page. But If I create a loop where a page is downloaded each cycle of the loop python crashes. Any idea what to do ? There isn't any exception or error message - Python itself crashes and OSX offers to submit an error report – Rahul Iyer Jan 17 '18 at 09:29
  • 3
    @KaizerSozay i have same problem, did you find any solution? – Viktor Mar 18 '19 at 07:30
  • No idea. Can’t remember what I did – Rahul Iyer Mar 18 '19 at 07:31
  • How does html_str contains html data? – Abhay May 05 '19 at 17:13
  • I tried running this on an event page at Carnegie Hall and I got the pre-rendered version with the script and not the script's results. https://www.carnegiehall.org/calendar/2022/09/29/Carnegie-Halls-Opening-Night-Gala-The-Philadelphia-Orchestra-0700PM – Buzzy Hopewell Aug 12 '22 at 23:21
2

Never too late... I got the same issue and found description of it here: http://pyqt.sourceforge.net/Docs/PyQt5/gotchas.html#crashes-on-exit

I followed the advice of puting the QApplication in a global variable (I know it is dirty... and I will be punished for that) and it works "fine". I can loop without any crash.

Hope this will help.