Scrape dynamically loaded website with python

Question

I´m new to scraping dynamically loaded websites and I´m stuck at trying to scrape the teamnames and odds of this website

https://www.cashpoint.com/de/fussball/deutschland/bundesliga

I tried it with PyQt5 like in this post

PyQt4 to PyQt5 -> mainFrame() deprecated, need fix to load web pages

class Page(QWebEnginePage):

    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def _on_load_finished(self):
        self.html = self.toHtml(self.Callable)
        print('Load finished')

    def Callable(self, html_str):
        self.html = html_str
        self.app.quit()

def main():

    page = Page('https://www.cashpoint.com/de/fussball/deutschland/bundesliga')
    soup = bs.BeautifulSoup(page.html, 'html.parser')
    js_test = soup.find('div', class_='game__team game__team__football')
    print(js_test.text)

if __name__ == '__main__': main()

But it did not work for the website I want to scrape. I´m getting a, AttributeError: 'NoneType' object has no attribute 'text' Error. I´m not getting the content of the site with this method, although in the post above there a method written for dynamically loaded websites. As I have read, the first approach when dealing with dynamically loaded websites is to identify how the data is rendered on the page. How do I do that and why isn´t PyQt5 working for this website? The way with Selenium isn´t an option for me since it would be too slow to get live odds. Can I get the html content of the site as it is shown when I inspect the site to use it then the normal way with Beautifulsoup or Scrapy? Thank you in advance.

Why is selenium too slow? It should be able to load a page on the order of seconds and it would be the simplest solution here. — Joseph Rajchwald, Dec 12 '19 at 21:53
As somebody mentioned at this forum, Selenium uses a big amount of resources and when you scrape 50 sites parallely with threads it´s gonna be slowed down isn´t it? — rickyspanish, Dec 13 '19 at 13:44

score 1 · Answer 1 · answered Dec 12 '19 at 23:25

The code that provides fails because even when the page has finished loading new elements are created asynchronously such as the divs you want to get "game__team" and "game__team__football" so at the time the loadFinished signal is emitted even those elements They are not created.

One possible solution is to use javascript directly to get the list of texts using the runJavaScript () method, and if the list is empty then try again at a time T until the list is not empty.

import sys

from PyQt5 import QtCore, QtWidgets, QtWebEngineWidgets


class Scrapper(QtCore.QObject):
    def __init__(self, interval=500, parent=None):
        super().__init__(parent)

        self._result = []
        self._interval = interval

        self.page = QtWebEngineWidgets.QWebEnginePage(self)
        self.page.loadFinished.connect(self.on_load_finished)
        self.page.load(
            QtCore.QUrl("https://www.cashpoint.com/de/fussball/deutschland/bundesliga")
        )

    @property
    def result(self):
        return self._result

    @property
    def interval(self):
        return self._interval

    @interval.setter
    def interval(self, interval):
        self._interval = interval

    @QtCore.pyqtSlot(bool)
    def on_load_finished(self, ok):
        if ok:
            self.execute_javascript()
        else:
            QtCore.QCoreApplication.exit(-1)

    def execute_javascript(self):
        self.page.runJavaScript(
            """
        function text_by_classname(classname){ 
            var texts = [];
            var elements = document.getElementsByClassName(classname);
            for (const e of elements) {
                texts.push(e.textContent);
            }
            return texts;
        }
        [].concat(text_by_classname("game__team"), text_by_classname("game__team__football"));
        """,
            self.javascript_callback,
        )

    def javascript_callback(self, result):
        if result:
            self._result = result
            QtCore.QCoreApplication.quit()
        else:
            QtCore.QTimer.singleShot(self.interval, self.execute_javascript)


def main():
    app = QtWidgets.QApplication(sys.argv)
    scrapper = Scrapper(interval=1000)
    app.exec_()
    result = scrapper.result
    del scrapper, app

    print(result)


if __name__ == "__main__":
    main()

Output:

[' 1899 Hoffenheim ', ' FC Augsburg ', ' Bayern München ', ' Werder Bremen ', ' Hertha BSC ', ' SC Freiburg ', ' 1. Fsv Mainz 05 ', ' Borussia Dortmund ', ' 1. FC Köln ', ' Bayer 04 Leverkusen ', ' SC Paderborn ', ' FC Union Berlin ', ' Fortuna Düsseldorf ', ' RB Leipzig ', ' VFL Wolfsburg ', ' Borussia Mönchengladbach ', ' FC Schalke 04 ', ' Eintracht Frankfurt ', ' Werder Bremen ', ' 1. Fsv Mainz 05 ', ' Borussia Dortmund ', ' RB Leipzig ', ' FC Augsburg ', ' Fortuna Düsseldorf ', ' FC Union Berlin ', ' 1899 Hoffenheim ', ' Bayer 04 Leverkusen ', ' Hertha BSC ', ' Borussia Mönchengladbach ', ' SC Paderborn ', ' VFL Wolfsburg ', ' FC Schalke 04 ', ' Eintracht Frankfurt ', ' 1. FC Köln ', ' SC Freiburg ', ' Bayern München ', ' 1899 Hoffenheim ', ' Borussia Dortmund ', ' Bayern München ', ' VFL Wolfsburg ', ' 1899 Hoffenheim ', ' Bayern München ', ' Hertha BSC ', ' 1. Fsv Mainz 05 ', ' 1. FC Köln ', ' SC Paderborn ', ' Fortuna Düsseldorf ', ' VFL Wolfsburg ', ' FC Schalke 04 ', ' Werder Bremen ', ' Borussia Dortmund ', ' FC Augsburg ', ' FC Union Berlin ', ' Bayer 04 Leverkusen ', ' Borussia Mönchengladbach ', ' VFL Wolfsburg ', ' Eintracht Frankfurt ', ' SC Freiburg ', ' 1899 Hoffenheim ', ' Bayern München ']

Thank you for your answer. I´m still not sure how the classname in the text_by_classname gets paster from [].concat(text_by_classname("game__team"), text_by_classname("game__team__football")). And how document.getElementsByClassName(classname); loads the content which I need. I searched for documentation of runJavaScript but couldn´t find any explanation. Maybe you can tell me the source where you learned it from? — rickyspanish, Dec 14 '19 at 13:53
@rickyspanish That is javascript code, it has nothing to do with Qt/PyQt. See https://developer.mozilla.org/en-US/docs/Web/API/Document/getElementsByClassName, https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/concat. runJavaScript is a QWebEnginePage method that allows you to run javascript and return the result in a callback. If you are going to get into the world of scrapping I recommend you learn HTML and javascript, and how it works. — eyllanesc, Dec 14 '19 at 17:45

score 0 · Answer 2 · answered Dec 13 '19 at 20:02

0

My suggestion to you is to use Selenium as a solution:

pip install selenium

from selenium import webdriver
from bs4 import BeautifulSoup as soup

driver = webdriver.Firefox(executable_path = '/Users/alireza/Downloads/geckodriver')
driver.get(URL)
driver.maximize_window()
page_source = driver.page_source
page_soup = soup(page_source, 'html.parser')

js_test = page_soup.find("div", {"class":"game__team game__team__football"})
print(js_test.text)

you can download geckodriver from here

If you want to see example code you can check here It's a web scraper for www.tripadvisor.com. Hope this helps.

answered Dec 13 '19 at 20:02

Alireza Nazari

193
1
10

Thank you for your answer. I will try it but was I missinformed about that selenium uses a lot of cpu resources and is slower than other methods? – rickyspanish Dec 14 '19 at 13:55
@rickyspanish i didn't find any better library for this task and as i used it a lot for my research i did not face any extreme CPU usage. – Alireza Nazari Dec 14 '19 at 15:18

Scrape dynamically loaded website with python

2 Answers2