I am following the tutorial from here, called
Introduction to Web Scraping (Python) - Lesson 04 (Download Images)
Bellow is the code that I run on a Ubuntu 16.04 os:
import urllib
from urllib2 import urlopen, build_opener
from bs4 import BeautifulSoup
def make_soup(url):
thepage = urlopen(url)
opener = build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open('https://www.imdb.com/search/name?gender=male,female&ref_=nv_tp_cel_1')
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
soup = make_soup("https://www.imdb.com/search/name?gender=male,female&ref_=nv_tp_cel_1")
i=1
for img in soup.findAll('img'):
print(img.get('src'))
filename=str(i)
i=i+1
#urllib.urlretrieve(img.get('src'),filename)
imagefile = open(filename + ".jpeg", 'wb')
theLink = urllib.urlopen(img.get('src'))
imagefile.write(theLink.read())
imagefile.close()
It looks like it downloads all the images but when I try to open any of them I get:
Could not load image '1.jpeg'. Error interpreting JPEG image file (Not a JPEG file: starts with 0x3c 0x21)
If I run less 1.jpeg
I get:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>403 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Bad request.
<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: 9aEqiCgrzrSAsiL9Q8uvHlgu4SAaDxdBNclFG3AJjxtKn1R7RA35-A==
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>
My goal is to download all the pictures from the website, I tried other websites but with no success.