2

I wrote the following Python code to crawl the images from the website www.style.com

 import urllib2, urllib, random, threading
 from bs4 import BeautifulSoup
 import sys
 reload(sys)
 sys.setdefaultencoding('utf-8')

 class Images(threading.Thread):
   def __init__(self, lock, src):
     threading.Thread.__init__(self)
     self.src = src
     self.lock = lock

   def run(self):
     self.lock.acquire()
     urllib.urlretrieve(self.src,'./img/'+str(random.choice(range(9999))))
     print self.src+'get'
     self.lock.release()

 def imgGreb():
   lock = threading.Lock()
   site_url = "http://www.style.com"
   html = urllib2.urlopen(site_url).read()
   soup = BeautifulSoup(html)
   img=soup.findAll(['img'])
   for i in img:
     print i.get('src')
     Images(lock, i.get('src')).start()

 if __name__ == '__main__':
   imgGreb()

But I got this error:

IOError: [Errno 2] No such file or directory: '/images/homepage-2013-october/header/logo.png'

How can it be solved?

Also can this recursively find all the images in the website? I mean other images that are not on the homepage.

Thanks!

ProgramFOX
  • 6,131
  • 11
  • 45
  • 51
randomp
  • 357
  • 1
  • 5
  • 18

1 Answers1

0
  1. You are using the relative path without the domain when you tried to retrieve the URL.
  2. Some of the images are javascript based and you will get the relative path to be javascript:void(0);, which you will never get the page. I added the try except to get around that error. Or you can smartly detect if the URL ends with jpg/gif/png or not. I will that work to you :)
  3. BTW, not all the images are included in the URL, some of the pictures, Beautiful One, are called using Javascript, will there is nothing we can do using urllib and beautifulsoup only. If you really want to challenge yourself, maybe you can try to learn Selenium, which is a more powerful tool.

Try the code below directly:

import urllib2
from bs4 import BeautifulSoup
import sys
from urllib import urlretrieve
reload(sys)


def imgGreb():
    site_url = "http://www.style.com"
    html = urllib2.urlopen(site_url).read()
    soup = BeautifulSoup(html)
    img=soup.findAll(['img'])
    for i in img:
        try:
            # built the complete URL using the domain and relative url you scraped
            url = site_url + i.get('src')
            # get the file name 
            name = "result_" + url.split('/')[-1] 
            # detect if that is a type of pictures you want
            type = name.split('.')[-1]
            if type in ['jpg', 'png', 'gif']:
                # if so, retrieve the pictures
                urlretrieve(url, name)
        except:
            pass

if __name__ == '__main__':
    imgGreb()
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178
  • it will generate errors: InvalidURL: nonnumeric port: 'void(0);' – randomp Nov 03 '13 at 17:33
  • @randomp I temporarily removed your OOP part because it is confusing at the beginning. Maybe you can take a try and see if those code work. If so, you can reimplement using OOP. – B.Mr.W. Nov 03 '13 at 17:58
  • @randomp Is it working for you? If so, please mark this question as answered and it will be helpful to other people – B.Mr.W. Nov 03 '13 at 18:22