1

I have the following Python codes running in my Jupyter notebook:

from lxml.html import parse
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[@class="chart full-width"]//td[@class="titleColumn"]//a')

movies[0].text_content()

The above codes give me the following output:

'The Shawshank Redemption'

Basically, it is the content of the first row of the column named 'titleColumn' on that webpage. In that same table there is another column called 'posterColumn' which contains a thumbnail image.

Now I want my codes to retrieve those images and the output to also show that image.

Do I need to use another package to achieve this? Can the image be shown in Jupyter Notebook?

user3115933
  • 4,303
  • 15
  • 54
  • 94
  • There's a very similar question [using bautifulsoup](http://stackoverflow.com/questions/18304532/extracting-image-src-based-on-attribute-with-beautifulsoup). – adabsurdum Mar 28 '17 at 09:38
  • Thanks. I missed that one. I'll have a look and see where it goes from there. – user3115933 Mar 28 '17 at 09:41

1 Answers1

0

To get the associated images, you need to get the posterColumn. From this you can extract the img src entry and pull the jpg images. The file can then be saved based on the movie title, with care to remove any non-valid filename characters such as ::

from lxml.html import parse
import requests
import string

valid_chars = "-_.() " + string.ascii_letters + string.digits
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[@class="chart full-width"]//td[@class="titleColumn"]//a')
posters = tree.findall('.//table[@class="chart full-width"]//td[@class="posterColumn"]//a')

for p, m in zip(posters, movies):
    for element, attribute, link, pos in p.iterlinks():
        if attribute == 'src':
            print "{:50} {}".format(m.text_content(), link)
            poster_jpg = requests.get(link, stream=True)
            valid_filename = ''.join(c for c in m.text_content() if c in valid_chars)

            with open('{}.jpg'.format(valid_filename), 'wb') as f_jpg:
                for chunk in poster_jpg:
                    f_jpg.write(chunk)

So currently you would see something starting as:

The Shawshank Redemption                           https://images-na.ssl-images-amazon.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_UY67_CR0,0,45,67_AL_.jpg
The Godfather                                      https://images-na.ssl-images-amazon.com/images/M/MV5BZTRmNjQ1ZDYtNDgzMy00OGE0LWE4N2YtNTkzNWQ5ZDhlNGJmL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_UY67_CR1,0,45,67_AL_.jpg
The Godfather: Part II                             https://images-na.ssl-images-amazon.com/images/M/MV5BMjZiNzIxNTQtNDc5Zi00YWY1LThkMTctMDgzYjY4YjI1YmQyL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_UY67_CR1,0,45,67_AL_.jpg
Martin Evans
  • 45,791
  • 17
  • 81
  • 97