The page you are working with consists of iframe
s which is basically a way of including one page into the other. Browsers understand how iframe
s work and would download pages and display them in the browser window.
urllib2
, though, is not a browser and cannot do that. You need to explore where the list of links is located, in which iframe
and then follow the url where this iframe
's content is coming from. In your case, the list of links on the left is coming from the http://www.vgmuseum.com/nes_b.html
page.
Here is a working solution to follow links in the list of links, download pages containing images and the downloading images into the images/
directory. I am using requests
module and utilizing lxml
parser teamed up with BeautifulSoup
for faster HTML parsing:
from urllib.parse import urljoin
import os
import requests
from bs4 import BeautifulSoup
url = "http://www.vgmuseum.com/nes_b.html"
def download_image(session, url):
print(url)
local_filename = os.path.join("images", url.split('/')[-1])
r = session.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
with requests.Session() as session:
session.headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'
}
response = session.get(url)
soup = BeautifulSoup(response.content, "lxml")
for link in soup.select('ol > li > a[href*=images]'):
response = session.get(urljoin(response.url, link.get('href')))
for image in BeautifulSoup(response.content, "lxml").select("img[src]"):
download_image(session, url=urljoin(response.url, image["src"]))