1

I started using PyQt5 for dynamic javascript scraping and already ran into trouble. When i have multiple urls to scrape, python crashes after first or second url, no matter what the domain. I can get data from the first, but not from the second page. Error logs from windows show Qt5WebEngineCore.dll as the cause of error, but I got no clue what to do. I really didnt find anything useful on it from elswhere on the web. Here's the code:

import sys
import requests
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from bs4 import BeautifulSoup
import re


class Client(QWebEnginePage):

    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))    #Ignote mainFrame from PyQt4
        self.app.exec_()

    def _on_load_finished(self):
        self.html = self.toHtml(self.Callable)
        print('Load finished')

    def Callable(self, html_str):
        self.html = html_str
        self.app.quit()


def scrape_pyqt5():
    lists = ['example1.com/a', 'example1.com/b', 'example1.com/c']
    for url in lists:
        r = Client(url)
        bs = BeautifulSoup(r.html, 'html.parser')
        for link in bs.find_all('div', {'id': 'media-player'}):
            for directlink in link.find_all('iframe'):
                print(directlink)
eyllanesc
  • 235,170
  • 19
  • 170
  • 241
murdock477
  • 25
  • 5
  • show your code. – eyllanesc Dec 28 '17 at 19:26
  • i added it to the main question – murdock477 Dec 28 '17 at 19:36
  • I think it's because of this line: `self.load(QUrl(url))`. The object gets garbage collected. Pull it out of the function and don't make it disappear so python thinks it is needed. –  Dec 28 '17 at 19:42
  • okay, but what excatly do you mean by 'don't make it dissapear'? – murdock477 Dec 28 '17 at 19:47
  • @murdock477 I mean, do something with that instance. Like `self.url = QUrl(url); self.load(self.url)`. I might be utterly wrong, though. Just an assumption, hence a comment. By "don't make it disappear" I mean: pretend you need the object so that it never expires and gets out of the scope and gets GC'ed. –  Dec 28 '17 at 19:55
  • thank you, but it didn't seem to work – murdock477 Dec 28 '17 at 20:04
  • Okay, I thought I understood what you said (I didn't), but then I realized I'm running the object again and again every time I loop to another URL! Ill try to figure something out now. – murdock477 Dec 28 '17 at 20:18

0 Answers0