0

I've not done much threading but was wondering if I could save images on a web page concurrently rather than one at a time.

Currently my code does the latter:

while pageCount <= 5:
soup = BeautifulSoup(urllib2.urlopen("http://www.url.../%d" % pageCount))

for link in soup.find_all("div", class_="photo"):
    pic = link.findAll('img')
    url = re.search("(?P<url>https?://[^\s]+\.(?:jpe?g))", str(pic)).group("url") 
    count +=1 
    urllib.urlretrieve(url,'C:\Desktop/images/pics%s.jpg' % count)
pageCount +=1 

I was thinking that this process could be speeded up by adopting a multi-threaded approach but was unsure of how.

Thanks

user94628
  • 3,641
  • 17
  • 51
  • 88
  • 1
    you may check this question (uses gevents): http://stackoverflow.com/questions/4119680/multiple-asynchronous-connections-with-urllib2-or-other-http-library or you may consider to pooling connections (http://stackoverflow.com/questions/2009243/how-to-speed-up-pythons-urllib2-when-doing-multiple-requests) – furins Jan 24 '14 at 14:09

4 Answers4

4

scrapy is doing it in parallel and have a ready to use image download middleware

Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42
4

Multithreading in Python will only make the script faster at the points where IO is blocking due to the GIL, any CPU/IO intensive applications will unlikely see any performance increase (if anything, they might get slower).

I have written scrapers for a magnitude of different sites (some as large as 8+ TB in data). Python will struggle to get full line rate on a single script, your best bet is to use a proper job queue (such as celery), then run multiple workers to achieve concurrency.

If you don't want celery, then another hacky approach is to use subprocess to call multiple instances of curl/wget/axel then block until they return, check exit code, check the file exists etc. However if your script does not exit cleanly, then you will end up with zombie processes (i.e. downloads continuing even after you kill the script). If you don't like the idea of subprocess, then you can use something like eventlet or gevent but you won't achieve full line rate on a single script, you'll have to run multiple workers.

Some sites have rate limiting, so using a job queue is usually a great way of getting around this (i.e. lots of EC2 instances with random IPs), with X number of workers on each to get maximum throughput.

Python is a perfectly fine tool for scraping huge amounts of data, you just have to do it correctly.

Also, pyquery is significantly faster than BeautifulSoup in many cases for processing results. At the very least, don't rely on the BeautifulSoup library to request the data for you. Use something like python-requests to fetch the result, then pass it into your parser (i.e. soup or pyquery etc).

There are also scalability considerations if you plan on scraping/storing large amounts of data, such as bandwidth optimizations when processing jobs and downloading content. There are storage clusters which allow you to send a URL to their API, and they handle downloading the content for you. This saves wasting bandwidth by downloading then uploading the file into your backend - this can cut your bandwidth bill in half.

It's also worth mentioning that threading+BeautifulSoup has been discussed already;

Urllib2 & BeautifulSoup : Nice couple but too slow - urllib3 & threads?

Community
  • 1
  • 1
SleepyCal
  • 5,739
  • 5
  • 33
  • 47
1

If you look for DIY solution, use those:

I think you may map your loop body on whole soup.findall() with pool.

Community
  • 1
  • 1
Filip Malczak
  • 3,124
  • 24
  • 44
1

Just use a pool, either thread or multiprocessing one.

from multiprocessing.pool import ThreadPool
pool = ThreadPool(10) ## tweak this param


while pageCount <= 5:
   soup = BeautifulSoup(urllib2.urlopen("http://www.url.../%d" % pageCount))
   for link in soup.find_all("div", class_="photo"):
       pic = link.findAll('img')
       url = re.search("(?P<url>https?://[^\s]+\.(?:jpe?g))", str(pic)).group("url") 
       count +=1 
       _ = pool.apply_async(urllib.urlretrieve, (url,'C:\Desktop/images/pics%s.jpg' % count))
   pageCount +=1 


pool.close()
pool.join()
vartec
  • 131,205
  • 36
  • 218
  • 244