PyQt5 to scrape IMDb webpage

Question

I have now started working on Web Scraping with python, and I want to scrape the image from this link. And this is the screenshot of "Inspect". This is the code I tried, as it involves JavaScript.

import bs4 as bs
import sys
import urllib.request
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl

class Page(QWebEnginePage):
    def __init__(self, url):
        self.app = QApplication(sys.argv)
        QWebEnginePage.__init__(self)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)
        self.load(QUrl(url))
        self.app.exec_()

    def _on_load_finished(self):
        self.html = self.toHtml(self.Callable)
        print('Load finished')

    def Callable(self, html_str):
        self.html = html_str
        self.app.quit()


def main():
    page = Page('https://www.imdb.com/name/nm0005683/mediaviewer/rm2073384192')
    soup = bs.BeautifulSoup(page.html, 'html.parser')
    imagetag = soup.find('div', id='photo-container')
    print (imagetag)

if __name__ == '__main__': main()

This code is actually from here and I altered just the link

And the error I'm getting

js: Uncaught TypeError: Cannot read property 'x' of undefined
Load finished
<div id="photo-container"></div>

I do not know what actually the error is, the contents of isn't showing up I did try googling the error but couldn't find anything that could help this situation. Also, if I should try any other method to scrape the image instead of this, I'm open to those suggestions too.

PS: I'm also new to StackOverFlow, so if anything here is not against rules, I can edit the question as required.

I'd say the TypeError is a red herring. The problem with scraping pages is pages often have problems that are mostly benign. Your code did what you asked it to do. The TypeError is probably normal output you'd see if you looked at the console output in a browser. — shao.lo, Apr 18 '18 at 05:45
@shao.lo if it has done what I asked it to, then how can I display what it has done? the `print (imagetag)` isn't showing the complete contents of `
` because of that error — Ullas Pv, Apr 18 '18 at 13:29
BeautifulSoup is going to process the raw html. What gets rendered on the page is often filled in with javascript dynamically. If you look at the page source you'll see that is the case. To get the actual contents, you'll need to do that in the page via javascript. — shao.lo, Apr 18 '18 at 14:18

score 0 · Accepted Answer · answered Apr 18 '18 at 16:02

You will probably want to use a webchannel to do the actual work, but the following shows you how to access the images you are looking for. I'll leave the webchannel research for you.

import sys
from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl, QTimer

class Page(QWebEnginePage):
    def __init__(self, parent):
        QWebEnginePage.__init__(self, parent)
        self.html = ''
        self.loadFinished.connect(self._on_load_finished)

    def _on_load_finished(self):
        print('Load finished')
        QTimer.singleShot(1000, self._after_loading)  # load finished does not mean rendered..may need to wait here
        QTimer.singleShot(5000, self._exit)

    def _after_loading(self):
        print('_after_loading')
        js = '''console.log('javascript...');
        var images = document.querySelectorAll('#photo-container img');
        console.log('images ' + images);
        console.log('images ' + images.length);
        for (var i = 0; i < images.length; i++)
        {
            var image = images[i];
            console.log(image.src);
        }        
        var element = document.querySelector('body');
        //console.log(element.innerHTML);  // If you uncomment this you'll see the the photo-container is still empty
        '''
        self.runJavaScript(js)
        print('_after_loading...done')

    def _exit(self):
        print('_exit')
        QApplication.instance().quit()

    def javaScriptConsoleMessage(self, level: QWebEnginePage.JavaScriptConsoleMessageLevel, message: str, lineNumber: int, sourceID: str):
        print(message)

def main():
    app = QApplication(sys.argv)
    w = QWebEngineView()
    w.setPage(Page(w))
    w.load(QUrl('https://www.imdb.com/name/nm0005683/mediaviewer/rm2073384192'))
    w.show()
    app.exec_()

if __name__ == '__main__': main()

PyQt5 to scrape IMDb webpage

1 Answers1