0

I am following the tutorial from here, called

Introduction to Web Scraping (Python) - Lesson 04 (Download Images)

Bellow is the code that I run on a Ubuntu 16.04 os:

import urllib
from urllib2 import urlopen, build_opener
from bs4 import BeautifulSoup

def make_soup(url):
    thepage = urlopen(url)

    opener = build_opener()
    opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
    response = opener.open('https://www.imdb.com/search/name?gender=male,female&ref_=nv_tp_cel_1')
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata

soup = make_soup("https://www.imdb.com/search/name?gender=male,female&ref_=nv_tp_cel_1")

i=1

for img in soup.findAll('img'):
    print(img.get('src'))

    filename=str(i)
    i=i+1

    #urllib.urlretrieve(img.get('src'),filename)
    imagefile = open(filename + ".jpeg", 'wb')
    theLink = urllib.urlopen(img.get('src'))
    imagefile.write(theLink.read())
    imagefile.close()

It looks like it downloads all the images but when I try to open any of them I get:

Could not load image '1.jpeg'. Error interpreting JPEG image file (Not a JPEG file: starts with 0x3c 0x21)

If I run less 1.jpeg I get:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>403 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Bad request.

<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: 9aEqiCgrzrSAsiL9Q8uvHlgu4SAaDxdBNclFG3AJjxtKn1R7RA35-A==
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>

My goal is to download all the pictures from the website, I tried other websites but with no success.

Cezar Cobuz
  • 1,077
  • 1
  • 12
  • 34
  • Maybe see also https://datascience.stackexchange.com/questions/5534/how-to-scrape-imdb-webpage – tripleee Jun 04 '18 at 09:20
  • Maybe see also https://www.imdb.com/conditions – tripleee Jun 04 '18 at 09:21
  • 1
    The website you are scraping probably has special programming to prevent that kind of scraping of images. That's why what you actually save here is an error page for each image instead of the image itself. It's very possible that the website knows by the request headers that the request is coming from a script and not from a browser. Try changing the headers, read this: https://stackoverflow.com/questions/802134/changing-user-agent-on-urllib2-urlopen – Ofer Sadan Jun 04 '18 at 09:21
  • 2
    Possible duplicate of [IMDB Poster URL Returns Referral Denied](https://stackoverflow.com/questions/11044010/imdb-poster-url-returns-referral-denied) – tripleee Jun 04 '18 at 09:22
  • You can often overcome a website blocking web scraping by just making sure the User-Agent header resembles a real browser. Not sure if that will work for IMDB, but it's something I've had to for other sites. – selbie Jun 04 '18 at 09:27
  • @OferSadan I read what you recommended and I will edit the question, but unfortunately I encounter the same problem – Cezar Cobuz Jun 05 '18 at 09:50
  • @Elliad see also the link tripleee sent, they state that it's illegal to do that anyway – Ofer Sadan Jun 05 '18 at 09:57
  • I tried other websites as well, with the same problem, I was just curious if it works on any website, not IMDB specifically. Is there any website that might permit this kind of scraping? – Cezar Cobuz Jun 05 '18 at 10:01

1 Answers1

0

The code below might help you:

import requests, urllib.request
from bs4 import BeautifulSoup

# Make HTTP request
url = "https://www.imdb.com/search/name/?gender=male,female&ref_=nv_tp_cel_1"
response = requests.get(url)
print(response.status_code)

# Parse HTML
soup = BeautifulSoup(response.content, 'html.parser')
response.close()

lister_list = soup.find('div',{"class":"lister-list"})
lister_items = lister_list.find_all("div",{"class":"lister-item"})

for i in lister_items:
    image = {}

    # Find image info inside each item
    image['item'] = i.find("div",{"class":"lister-item-image"}).find("img")
    image['alt'] = image['item']['alt']
    image['src'] = image['item']['src']

    # Save image
    urllib.request.urlretrieve(str(image['src']), f"{image['alt']}.jpg")