Scraping image hrefs from an Ordered List using BeautifulSoup

Question

I am trying to retrieve the images from this website (with permission). Here is my code below with the website I want to access:

import urllib2
from bs4 import BeautifulSoup

url = "http://www.vgmuseum.com/nes.htm"

page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html5lib")
li = soup.select('ol > li > a')
for link in li:
   print(link.get('href'))

The images I would like to use are in this ordered list here: list location for images

You'd first have to scrape the left frame (`http://www.vgmuseum.com/nes_b.html`) to get a list of the URLs that correspond to each set of images you want, then you'd need to go to those URLs and scrape the images, so it's a multi-step process. I'm not going to write it for you since you have a good start, but that's how you'd approach it. — Dan, Dec 11 '17 at 22:54
I'm not asking anyone to write code for me. I'm just asking what needs to be improved. This is my first time using this library. Thanks for your answer — unmatchedsock, Dec 11 '17 at 23:01

score 1 · Answer 1 · answered Dec 11 '17 at 23:31

The page you are working with consists of iframes which is basically a way of including one page into the other. Browsers understand how iframes work and would download pages and display them in the browser window.

urllib2, though, is not a browser and cannot do that. You need to explore where the list of links is located, in which iframe and then follow the url where this iframe's content is coming from. In your case, the list of links on the left is coming from the http://www.vgmuseum.com/nes_b.html page.

Here is a working solution to follow links in the list of links, download pages containing images and the downloading images into the images/ directory. I am using requests module and utilizing lxml parser teamed up with BeautifulSoup for faster HTML parsing:

from urllib.parse import urljoin

import os
import requests
from bs4 import BeautifulSoup

url = "http://www.vgmuseum.com/nes_b.html"


def download_image(session, url):
    print(url)
    local_filename = os.path.join("images", url.split('/')[-1])

    r = session.get(url, stream=True)
    with open(local_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:  # filter out keep-alive new chunks
                f.write(chunk)


with requests.Session() as session:
    session.headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
    }
    response = session.get(url)
    soup = BeautifulSoup(response.content, "lxml")

    for link in soup.select('ol > li > a[href*=images]'):
        response = session.get(urljoin(response.url, link.get('href')))
        for image in BeautifulSoup(response.content, "lxml").select("img[src]"):
            download_image(session, url=urljoin(response.url, image["src"]))

score 0 · Answer 2 · answered Dec 11 '17 at 23:36

I used the url in @Dan's comment above for parsing.

Code:

import requests
from bs4 import BeautifulSoup

url = 'http://www.vgmuseum.com/nes_b.html'

page = requests.get(url).text
soup = BeautifulSoup(page, 'html.parser')
li = soup.find('ol')
soup = BeautifulSoup(str(li), 'html.parser')
a = soup.find_all('a')
for link in a:
    if not link.get('href') == '#top' and not link.get('href') == None:
        print(link.get('href'))

Output:

images/nes/10yard.html
images/nes2/10.html
pics2/100man.html
images/nes/1942.html
images/nes2/1942.html
images/nes/1943.html
images/nes2/1943.html
pics7/1944.html
images/nes/1999.html
images/nes2/2600.html
images/nes2/3dbattles.html
images/nes2/3dblock.html
images/nes2/3in1.html
images/nes/4cardgames.html
pics2/4.html
images/nes/4wheeldrivebattle.html
images/nes/634.html
images/nes/720NES.html
images/nes/8eyes.html
images/nes2/8eyes.html
images/nes2/8eyesp.html
pics2/89.html
images/nes/01/blob.html
pics5/boy.html
images/03/a.html
images/03/aa.html
images/nes/abadox.html
images/03/abadoxf.html
images/03/abadoxj.html
images/03/abadoxp.html
images/03/abarenbou.html
images/03/aces.html
images/03/action52.html
images/03/actionin.html
images/03/adddragons.html
images/03/addheroes.html
images/03/addhillsfar.html
images/03/addpool.html
pics/addamsfamily.html
pics/addamsfamilypugsley.html
images/nes/01/adventureislandNES.html
images/nes/adventureisland2.html
images/nes/advisland3.html
pics/adventureisland4.html
images/03/ai4.html
images/nes/magickingdom.html
pics/bayou.html
images/03/bayou.html
images/03/captain.html
images/nes/adventuresofdinoriki.html
images/03/ice.html
images/nes/01/lolo1.html
images/03/lolo.html
images/nes/01/adventuresoflolo2.html
images/03/lolo2.html
images/nes/adventuresoflolo3.html
pics/radgravity.html
images/03/rad.html
images/nes/01/rockyandbullwinkle.html
images/nes/01/tomsawyer.html
images/03/afroman.html
images/03/afromario.html
pics/afterburner.html
pics2/afterburner2.html
images/03/ai.html
images/03/aigiina.html
images/nes/01/airfortress.html
images/03/air.html
images/03/airk.html
images/nes/01/airwolf.html
images/03/airwolfe.html
images/03/airwolfj.html
images/03/akagawa.html
images/nes/01/akira.html
images/03/akka.html
images/03/akuma.html
pics2/adensetsu.html
pics2/adracula.html
images/nes/01/akumajo.html
pics2/aspecial.html
pics/alunser.html
images/nes/01/alfred.html
images/03/alice.html
images/nes/01/alien3.html
images/nes/01/asyndrome.html
images/03/alien.html
images/03/all.html
images/nes/01/allpro.html
images/nes/01/allstarsoftball.html
images/nes/01/alphamission.html
pics2/altered.html

Scraping image hrefs from an Ordered List using BeautifulSoup

2 Answers2