1

I am really new at Python and trying to scrape some data from a javascript rendered web page with the second one on this. When i try to apply this code for a for loop it returns only 2 results from the list with 50 items and gives "Process finished with exit code -1073740940 (0xC0000374)" error. Can anyone explain the reason please?

My sample is here:

class Page(QWebEnginePage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def _on_load_finished(self):
        self.html = self.toHtml(self.Callable)

    def Callable(self, html_str):
        self.html = html_str
        self.app.quit()


def main():
    global linklist
    for iurl in linklist:
        page = Page(iurl)
        soup = bs.BeautifulSoup(page.html, 'html.parser')
        data = soup.find('div', class_='tablo_dual_board')
        data = data.text
        data = data.splitlines()
        print(data)

I've also tried this one and it gives result only for the first list item. Is there anyway other than these to apply a function for list items?

for iurl in linklist:
    iurl=main()

My whole code is here:

import sys
from PyQt5 import QtCore, QtWidgets, QtWebEngineWidgets
import requests
from bs4 import BeautifulSoup
import bs4 as bs


class WebPage(QtWebEngineWidgets.QWebEnginePage):
    def __init__(self):
        super(WebPage, self).__init__()
        self.loadFinished.connect(self.handleLoadFinished)

    def start(self, urls):
        self._urls = iter(urls)
        self.fetchNext

    @property
    def fetchNext(self):
        try:
            url = next(self._urls)
        except StopIteration:
            return False
        else:
            self.load(QtCore.QUrl(url))
        return True

    def processCurrentPage(self, html):
        url = self.url().toString()
        # do stuff with html...
        soup = bs.BeautifulSoup(html, 'html.parser')
        veri = soup.find('div', class_='tablo_dual_board')
        veri = veri.text
        veri = veri.splitlines()
        print(veri)
        if not self.fetchNext:
            QtWidgets.qApp.quit()

    def handleLoadFinished(self):
        self.toHtml(self.processCurrentPage)

    def javaScriptConsoleMessage(self, *args):
        # disable javascript error output
        pass

if __name__ == '__main__':

    # generate some test urls

    onexurl = "https://1xbahis1.com/en/live/Football/"
    r = requests.get(onexurl)
    soup = BeautifulSoup(r.content, "html.parser")
    income = soup.find_all("ul", {"id":"games_content"})
    links = soup.find_all("a", {"class": "c-events__name"})

    urls = []
    for matchlink in links:
        urls.append("https://1xbahis1.com/en/"+(matchlink.get("href")))

    # only try 3 urls for testing
    urls = urls[:3]

    app = QtWidgets.QApplication(sys.argv)
    webpage = WebPage()
    webpage.start(urls)
    sys.exit(app.exec_())
ekhumoro
  • 115,249
  • 20
  • 229
  • 336
  • What are the urls: linklist?, to be able to test your question – eyllanesc Nov 19 '17 at 21:05
  • 1
    You cannot create more than one `QApplication`. See [this answer](https://stackoverflow.com/a/21294180/984421) for how to scrape multiple urls. It uses pyqt4/webkit, but you can easily port it to pyqt5. – ekhumoro Nov 19 '17 at 21:05
  • Thank you very much @ekhumoro i'm trying to adapt codes. I will turn with results. – Ahmet Uluer Nov 19 '17 at 22:04
  • @AhmetUluer. It should be easy. Replace the part where it says `# generate some test urls` with your own code for creating a list of urls. Then put your beautifulsoup code inside `processCurrentPage`, where it says `# do stuff with html`. – ekhumoro Nov 19 '17 at 22:17
  • @ekhumoro yes I've tried exactly what you said but Python crashed. Is it too heavy to run? – Ahmet Uluer Nov 19 '17 at 22:28
  • @ekhumoro Additionally as i tested it's about self.load(QtCore.QUrl(url)) – Ahmet Uluer Nov 19 '17 at 22:42
  • PS: I just re-tested the script, and it downloads 12 html pages in about 3 seconds and then terminates without any errors of any kind. This is using python 3.6.3, qt 5.9.2, and pyqt 5.9.1 on linux, and running it in a normal console. – ekhumoro Nov 19 '17 at 22:51
  • @ekhumoro Yes your code is working perfectly, but i think my code has problems. Thank you very much for your help. – Ahmet Uluer Nov 19 '17 at 23:05
  • @AhmetUluer. If you [edit your question](https://stackoverflow.com/posts/47381962/edit) and add your version of my script, I will try to explain how to fix it. – ekhumoro Nov 19 '17 at 23:35
  • Hi @ekhumoro I've added my whole code as edit. I will be very happy if you can help. Thanks in advance. – Ahmet Uluer Nov 20 '17 at 19:28
  • @AhmetUluer. I have edited your question and fixed your code. However, it was quite slow when I ran it - probably because the site loads lots of scripts, images, adverts, etc. It takes about 15-20 seconds before any output is shown. For testing, I have added a line so that only the first three urls are processed. If you want to see what I changed, [look here](https://stackoverflow.com/posts/47381962/revisions#). – ekhumoro Nov 20 '17 at 21:10
  • I really don't know what to say, you are so helpful. Thank you very much again for code and explanation. – Ahmet Uluer Nov 20 '17 at 21:22
  • Hi @ekhumoro i don't really want to disturb you again but i've been looking for it for whole two days, read whole class documentation but couldn't figure it out. Can you help me to get "veri = veri.splitlines() local output out of the class and make global. Or you can say me what to research for, both are ok. I want to use "veri" list in global for each webpage parsed. – Ahmet Uluer Nov 22 '17 at 16:00

0 Answers0