To get the associated images, you need to get the posterColumn
. From this you can extract the img src
entry and pull the jpg images. The file can then be saved based on the movie title, with care to remove any non-valid filename characters such as :
:
from lxml.html import parse
import requests
import string
valid_chars = "-_.() " + string.ascii_letters + string.digits
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[@class="chart full-width"]//td[@class="titleColumn"]//a')
posters = tree.findall('.//table[@class="chart full-width"]//td[@class="posterColumn"]//a')
for p, m in zip(posters, movies):
for element, attribute, link, pos in p.iterlinks():
if attribute == 'src':
print "{:50} {}".format(m.text_content(), link)
poster_jpg = requests.get(link, stream=True)
valid_filename = ''.join(c for c in m.text_content() if c in valid_chars)
with open('{}.jpg'.format(valid_filename), 'wb') as f_jpg:
for chunk in poster_jpg:
f_jpg.write(chunk)
So currently you would see something starting as:
The Shawshank Redemption https://images-na.ssl-images-amazon.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_UY67_CR0,0,45,67_AL_.jpg
The Godfather https://images-na.ssl-images-amazon.com/images/M/MV5BZTRmNjQ1ZDYtNDgzMy00OGE0LWE4N2YtNTkzNWQ5ZDhlNGJmL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_UY67_CR1,0,45,67_AL_.jpg
The Godfather: Part II https://images-na.ssl-images-amazon.com/images/M/MV5BMjZiNzIxNTQtNDc5Zi00YWY1LThkMTctMDgzYjY4YjI1YmQyL2ltYWdlL2ltYWdlXkEyXkFqcGdeQXVyNjU0OTQ0OTY@._V1_UY67_CR1,0,45,67_AL_.jpg